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CHAPTER 1 


Formal Grammars and Languages 


In this chapter we introduce some basic notions and some notations we will use in 
the book. 

The set of natural numbers {0,1,2,...} is denoted by N. 

Given a set A, |A| denotes the cardinality of A, and 24 denotes the powerset of A, 
that is, the set of all subsets of A. Instead of 24, we will also write Powerset(A). 

We say that a set S is countable iff either S is finite or there exists a bijection 
between S and the set N of natural numbers. 


1.1. Free Monoids 


Let us consider a countable set V, also called an alphabet. The elements of V are 
called symbols. The free monoid generated by the set V is the set, denoted V*, 
consisting of all finite sequences of symbols in V, that is, 

V* = {oy Oy > 0 and for i = 0,...,n, vr E V}. 
The unary operation * (pronounced ‘star’) is called Kleene star (or Kleene closure, 
or * closure). Sequences of symbols are also called words or strings. The length of 
a sequence U,...U, is n. The sequence of length 0 is called the empty sequence or 
empty word and it is denoted by e. The length of a sequence w is also denoted 
by |w]. 

Given two sequences w and wz in V*, their concatenation, denoted w * w2 or 
simply w we, is the sequence in V* defined by recursion on the length of w , as 
follows: 


W1 "W = We if wy =€ 
= U1((U2..-Un) * We) if wy = V12... Un with n>O0. 
We have that |wi * w2| = |wı|+|w2|. The concatenation operation + is associative 


and its neutral element is the empty sequence e. 

Any set of sequences which is a subset of V* is called a language (or a formal 
language) over the alphabet V. 

Given two languages A and B, their concatenation, denoted A + B, is defined as 
follows: 

A+ B= {w "w| w, € A and w € B}. 
Concatenation of languages is associative and its neutral element is the singleton {e}. 
When B is a singleton, say {w}, the concatenation A + B will also be written as A + w 
or simply Aw. Obviously, if A= Ø or A=@ then A» B= Í. 

We have that: V* = V0UV!UV?2U...UV*U..., where for each k > 0, V* is 
the set of all sequences of length k of symbols of V, that is, 
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VE = {v ... up| fore =0,...,k, v E V}. 
Obviously, V? = {e}, V! = V, and for h,k > 0, V» « VE = VP SVR, By V+ 
we denote V* — {e}. The unary operation * (pronounced ‘plus’) is called positive 


closure or * closure. 

The set V? UV! is also denoted by V ®t. 

Given an element a in a set V, a* denotes the set of all finite sequence of zero 
or more a’s (thus, a* is an abbreviation for {a}*), at denotes the set of all finite 
sequence of one or more a’s (thus, at is an abbreviation for {a}*), a°' denotes 
the set {e,a} (thus, a! is an abbreviation for {a}°1), and a” denotes the infinite 
sequence made out of all a’s. 

Given a word w, for any k >0, the prefix of w of length k, denoted w,, is defined 
as follows: 

w, = if |w|<k then w else u, where w = uv and |u|=k. 

In particular, for any w, we have that: wg = € and W | = w. 


Given a language L C V*, we introduce the following notation: 


(i) Lo = {e} 


(üi) L= L 

Gii "Shad? 
(iv) L*= Uso ES 
(v) Lt = Uso Lk 


(vi) Lot= DUD! 


We also have that L°t! = L” «L and Lt = L* — {e}. 

The complement of a language L with respect to a set V* is the set V*—L. This 
set is also denoted by ~L when V* is understood from the context. The language 
operation ~ is called complementation. 

From now on, unless otherwise stated, when referring to an alphabet, we will 
assume that it is a finite set of symbols. 


1.2. Formal Grammars 
In this section we introduce the notion of a formal grammar. 
DEFINITION 1.2.1. [Formal Grammar] A formal grammar (or a grammar, for 
short) is a 4-tuple (Vr, Vy, P, S}, where: 
(i) Vr is a finite set of symbols, called terminal symbols, 


(ii) Vx is a finite set of symbols, called nonterminal symbols or variables, such that 
Vr Vn = 9, 


(iii) P is a finite set of pairs of strings, called productions, each pair (a, 3) being 
denoted by a > 8, where a € V* and 8 € V*, with V = Vr U Vy, and 


(iv) S is an element of Vy, called axiom or start symbol. 
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The set Vr is called the terminal alphabet. The elements of Vr are usually denoted 
by lower-case Latin letters such as a,b,...,z. The set Vy is called the nonterminal 
alphabet. The elements of Vy are usually denoted by upper-case Latin letters such 
as A, B,...,Z. In a production a — £3, a is the left hand side (lhs, for short) and 
@ is the right hand side (rhs, for short). 


NOTATION 1.2.2. When presenting a grammar we will often indicate the set of 
productions and the axiom only, because the sets Vr and Vy can be deduced from 
the set of productions. The examples below will clarify this point. When writing 
the set of productions we will feel free to group together the productions with the 
same left hand side. For instance, we will write 

S—-Ala 
instead of 

SA 

Sa 
Sometimes we will also omit to indicate the axiom symbol when it is understood 
from the context. Unless otherwise indicated, the symbol S is assumed to be the 
axiom symbol. 


Given a grammar G = (Vr, Vyn, P, S) we may define a set of elements in V$, 
called the language generated by G as we now indicate. 

Let us first define the relation ~gC Vt x V* as follows: for every sequence 
a € V* and every sequence 8, y, and 6 in V*, 

yad —a 70 iff there exists a production a — @ in P. 
For any k >0, the k-fold composition of the relation +g is denoted +%. Thus, for 
instance, for every sequence co € V+ and every sequence oz E€ V*, we have that: 

oo 7% 02 iff oo a o, and o1 >G 02, for some o; E V+. 
The transitive closure of —>g is denoted a. The reflexive, transitive closure of >g 
is denoted —%. When it is understood from the context, we will feel free to omit 
the subscript G, and instead of writing >c , >E, >ġ, and -%, we simply write 


k + * x 
—, =", >T, and —*, respectively. 


DEFINITION 1.2.3. [Language Generated by a Grammar] Given a grammar 
G = (Vr, Vn, P, S), the language generated by G, denoted L(G), is the set 


L(G) = {w| w € Vř and S =>ġ w}. 


The elements of the language L(G) are said to be words or strings generated by the 
grammar G. 


In what follows we will use the following notion. 


DEFINITION 1.2.4. [Language Generated by a Nonterminal Symbol of a 
Grammar] Given a grammar G = (Vr, Vy, P, S}, the language generated by the 
nonterminal A € Vy, denoted Lg(A), is the set 


Lg(A) = {w|w € VF and A >% w}. 
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We will write L(A), instead of L(A), when the grammar G is understood from the 
context. 


DEFINITION 1.2.5. [Equivalence of Grammars] Two grammars are said to 
be equivalent iff they generate the same language. 


Given a grammar G = (Vr, Vy, P, S}, an element of V* is called a sentential 
form of G. 


The following fact is an immediate consequence of the definitions. 


Fact 1.2.6. Given a grammar G = (Vr, Vy, P, S} and a word w € V7, we have 
that w belongs to L(G) iff there exists a sequence (a1,...,Q@n) of n(>1) sentential 
forms such that: 

(i) Q1 = S, 
(ii) for every i = 1,...,n—1, there exist y, € V* such that a; = yad, &i+ı = 780, 
and a gq ĝ is a production in P, and 
(iii) an = w. 
Let us now introduce the following concepts. 


DEFINITION 1.2.7. [Derivation of a Word and Derivation of a Sentential 
Form] Given a grammar G = (Vr,Vy,P,S) and a word w € V7 in L(G), any 
sequence (a1, 02,..-,Qn—1, An) of n(>1) sentential forms satisfying Conditions (i), 
(ii), and (iii) of Fact 1.2.6 above, is called a derivation of w from S in the grammar 
G. A derivation (S,a2,...,Qn—1, W} is also written as: 

S — a>... — Qn >w oras: S —* w. 

More generally, a derivation of a sentential form y € V* from S in the grammar 
G is any sequence (Q1,Q2,..-,Q@n—1,An) Of n (> 1) sentential forms such that Con- 
ditions (i) and (ii) of Fact 1.2.6 hold, and a, = y. That derivation is also written 
as: 


S >a... >n >Y oras: S —>* g. 


DEFINITION 1.2.8. [Derivation Step] Given a derivation (a1,...,Q@n) ofn (>1) 
sentential forms, for any i = 1,...,n—1, the pair (a;, &;+1) is called a derivation step 
from a; to Qj41 (or a rewriting step from a; to a;41). A derivation step (a;, a;41) is 
also denoted by a; > aj41. 

Given a sentential form yad for some y, € V* and a € V*, if we apply the 
production a — (3, we perform the derivation step yad — 7/30. 


Given a grammar G and a word w € L(G) the derivation of w from S may not 
be unique as indicated by the following example. 


EXAMPLE 1.2.9. For instance, given the grammar 
({a}, {195,4}, {5 > a] A, Aa}, S}, 
we have the following two derivations for the word a from S: 
(i) Sa 
(ii) S-A-a oO 
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1.3. The Chomsky Hierarchy 


There are four types of formal grammars which constitute the so called Chomsky 
Hierarchy, named after the American linguist Noam Chomksy. Let Vr denote the al- 
phabet of the terminal symbols, Vy denote the alphabet of the nonterminal symbols, 
and V be Vr U Vy. 


DEFINITION 1.3.1. [Type 0, 1, 2, and 3 Production, Grammar, and Lan- 
guage. Version 1| (i) Every production a > 8 with a € Vt and 8 € V*, isa 
type 0 production. 


(ii) A production a —> £8 is of type 1 iff a, 8 € V* and the length of a is not greater 
than the length of (. 


(iii) A production a —> 8 is of type 2 iff a € Vy and GEV. 
(iv) A production a — 8 is of type 3 iff a € Vy and 8 € Vp UVrVy. 


For i = 0,1,2,3, a grammar is of type i if all its productions are of type i. For 
i = 0,1, 2,3, a language is of type i if it is generated by a type i grammar. 


REMARK 1.3.2. Note that in Definition 1.5.7 on page 21, we will slightly gen- 
eralize the above notions of type 1, 2, and 3 grammars and languages. In these 
generalized notions we will allow the generation of the empty word e. 


A production of the form A — 8, with A € Vy and 8 € V*, is said to be a 
production for (or of) the nonterminal symbol A. 


It follows from Definition 1.3.1 that for i = 0,1,2, a type i + 1 grammar is also 
a type i grammar. Thus, the four types of grammars we have defined, constitute a 
hierarchy which is called the Chomsky Hierarchy. 

Actually, this hierarchy is a proper hierarchy in the sense that there exists a 
grammar of type 7 which generates a language which cannot be generated by any 
grammar of type i+ 1, for i = 0,1, 2. 


As a consequence of the following Theorem 1.3.4, the class of type 1 languages 
coincides with the class of context-sensitive languages in the sense specified by the 
following definition. 


DEFINITION 1.3.3. |Context-Sensitive Production, Grammar, and Lan- 
guage. Version 1| Given a grammar G = (Vr,Vy,P,S), a production in P is 
context-sensitive if it is of the form wAv — uwv, where u,v E€ V*, A € Vy, 
and w € Vt. A grammar is a context-sensitive grammar if all its productions are 
context-sensitive productions. A language is context-sensitive if it is generated by a 
context-sensitive grammar. 


THEOREM 1.3.4. [Equivalence Between Type 1 Grammars and Context- 
Sensitive Grammars| (i) For every type 1 grammar there exists an equivalent 
context-sensitive grammar. (ii) For every context-sensitive grammar there exists an 
equivalent type 1 grammar. 
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The proof of this theorem is postponed to Chapter 4 (see Theorem 4.0.3 on page 172) 
and it will be given in a slightly more general setting where will allow the production 
S — € to occur in type 1 grammars (as usual, S denotes the axiom of the grammar). 


As a consequence of Theorem 1.3.4, instead of saying ‘type 1 languages’, we 
will say ‘context-sensitive languages’. For productions, grammars, and languages, 
instead of saying that they are ‘of type 0’, we will also say that they are ‘unrestricted’. 
Similarly, 

- instead of saying ‘of type 2’, we will also say ‘context-free’, and 
- instead of saying ‘of type 3’, we will also say ‘regular’. 

Due to their form, type 3 grammars are also called right linear grammars, or 

right recursive type 3 grammars. 


One can show that every type 3 language can also be generated by a grammar 
whose productions are of the form a — 8, where a € Vy and 8 € Vr U VynVr. 
Grammars whose productions are of that form are called left linear grammars or 
left recursive type 3 grammars. The proof of that fact is postponed to Section 2.4 
and it will be given in a slightly more general setting where we allow the production 
S — e€ to occur in right linear and left linear grammars (see Theorem 2.4.3 on 
page 40). 


Now let us present some examples of languages and grammars. 

The language Lo = {¢, a} is generated by the type 0 grammar whose axiom is S 
and whose productions are: 

Sale 

The set of terminal symbols is {a} and the set of nonterminal symbols is {S}. The 
language Lo cannot be generated by a type 1 grammar, because for generating the 
word € we need a production whose right hand side has a length smaller than the 
length of the corresponding left hand side. 


The language Lı = {a"b"c"|n>0} is generated by the type 1 grammar whose 
axiom is S and whose productions are: 


S -7aSBC\|aBC 
GB BC 

aB ab 

b B bb 

bC be 

cC > cc 


= 
= 
= 
= 


The set of terminal symbols is {a,b,c} and the set of nonterminal symbols is {5, 
B,C}. The language Lı cannot be generated by a context-free grammar. This fact 
will be shown later (see Corollary 3.11.2 on page 152). 


The language 
L = {w | w € {0,1}* and 
the number of 0’s in w is equal to the number of 1’s in w} 
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is generated by the context-free grammar whose axiom is S and whose productions 
are: 

S —0 Si | 1 So 

So — 0 | 0S | 1 So So 

Si — 1 | 1 S | 0S; Si 


The set of terminal symbols is {0,1} and the set of nonterminal symbols is {5, So, S1}. 
The language Lə cannot be generated by a regular grammar. This fact will be shown 
later and, indeed, it is a consequence of Corollary 2.9.2 on page 73. 


The language 
Lz = {w|w € {0,1}* and w does not contain two consecutive 1’s} 
is generated by the regular grammar whose axiom is S and whose productions are: 
S —0A |1B/0|1 
As VA |1B]0]1 
B —=0A |0 
The set of terminal symbols is {0, 1} and the set of nonterminal symbols is {S, A, B}. 


Since for i = 0,1, 2 there are type i languages which are not type 7+ 1 languages, 
we have that the set of type i languages properly includes the set of type i + 1 
languages. 


Note that if we allow productions of the form a — 8, where a € V* and 6 € V*, 
we do not extend the generative power of formal grammars in the sense specified by 
the following theorem. 


THEOREM 1.3.5. For every grammar whose productions are of the form a —> (3, 
where a € V* and 8 € V*, there exists an equivalent grammar whose productions 
are of the form a — 3, where a € V+ and 8 € V*. 


PROOF. Without loss of generality, let us consider a grammar G = (Vr, Vy, P, S) 
with a single production of the form £ — 8, where £ is the empty string and 8 € V*. 
Let us consider the set of productions Q = {E — G}U{a —> Ex, x > cE |x eV} 
where E is a new nonterminal symbol not in Vy. 

Now we claim that the type 0 grammar H = (Vr, VyU{E}, (P—{e- B})UQ, S} 
is equivalent to G. Indeed, we show that: 

(i) L(G) C L(A), and 
(ii) L(H) € L(G). 

Let us first assume that € ¢ L(G). Property (i) holds because, given a deriva- 
tion S —@ w for some word w, where in a particular derivation step we used the 
production € —g (3, then in order to simulate that derivation step, we can use ei- 
ther the production x >p Ex or the production x >p xE followed by E >p £. 
Property (ii) holds because, given a derivation S —%, w for some word w, where 
in a particular step we used the production x >p Ex or x >p £E, then in order 
to get a string of terminal symbols only, we need to apply the production E >p (3 
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and the sentential form derived by applying E —y (, can also be obtained in the 
grammar G by applying € >g $. 

If € € L(G) then we can prove Properties (i) and (ii) as in the case when 
€ Z L(G), because the derivation S >% € >g p can be simulated by the derivation 


Sy SEA}, E >y B. 
THEOREM 1.3.6. [Start Symbol Not on the Right Hand Side of Produc- 


tions] For i = 0,1, 2,3, we can transform every type i grammar G into an equivalent 
type i grammar H whose axiom occurs only on the left hand side of the productions. 


PROOF. In order to get the grammar H, for any grammar G of type 0, or 1, or 2, 
it is enough to add to the grammar G a new start symbol S” and then add the new 
production S — S. If the grammar G is of type 3, we do as follows. We consider 
the set of productions of G whose left hand side is the axiom S. Call it Ps. Then 
we add a new axiom symbol S’ and the new productions {9 — 6; | S > Bi € Ps}. 
It is easy to see that L(G) = L(A). oO 


DEFINITION 1.3.7. [Grammar in Separated Form] A grammar is said to be 
in separated form iff every production is of one of the following three forms, where 
u,v € VÑ, AE Vy, and a € Vr: 


(i) u—>v 
(ii) A-a 
(ii) Ae 


THEOREM 1.3.8. [Separated Form Theorem] For every grammar G there 
exists an equivalent grammar H in separated form such that there is at most one 
production of H of the form A — e where A is a nonterminal symbol. Thus, if 
£ € L(G) then every derivation of € from S is of the form S -* A > €. 


PROOF. We first prove that the theorem holds without the condition that there 
is at most one production of the form A — e. The productions of the grammar H 
are obtained as follows: 


(i) for every terminal a in G we introduce a new nonterminal symbol A and the 
production A — a and replace every occurrence of the terminal a both in the left 
hand side or the right hand side of a production of G, by A, and 


(ii) replace every production u — £, where |u| > 1, by u > C and C —> £, where C 
is a new nonterminal symbol. 

We leave it to the reader to check that the new grammar H is equivalent to the 
grammar G. 

Now we prove that for every grammar H obtained as indicated above, we can 
produce an equivalent grammar H’ with at most one production of the form A —> €. 
Indeed, consider the set {A; — £ | i € I} of all productions of the grammar H whose 
right hand side is £. The equivalent grammar H” is obtained by replacing that set 
by the new set {A; —> B|i € I} U {B — £}, where B is a new nonterminal symbol. 
We leave it to the reader to check that the new grammar H’ is equivalent to the 
grammar H. LJ 
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DEFINITION 1.3.9. [Kuroda Normal Form] A context-sensitive grammar is 
said to be in Kuroda normal form iff every production is of one of the following 
forms, where A, B,C € Vy anda € Vr: 


(i) A => BC 

(ii) AB — AC (the left context A is preserved) 
(iii) AB —CB (the right context B is preserved) 
(iv) A >B 

(v) A >a 


In order to prove the following theorem we now introduce the notion of the order 
of a production and the order of a grammar. 


DEFINITION 1.3.10. [Order of a Production and Order of a Grammar] 
We say that the order of a production u — v is n iff n is the maximum between |u| 
and |v|. We say that the order of a grammar G is n iff n is the maximum order of 
a production in G. 


We have that the order of a production (and of a grammar) is at least 1. 


THEOREM 1.3.11. [Kuroda Theorem] For every context-sensitive grammar 
there exists an equivalent context-sensitive grammar in Kuroda normal form. 


PROOF. Let G be the given context-sensitive grammar and let Gg be a grammar 
which is equivalent to G and it is in separated form. For every production u — v 
of the grammar G's which is not of the form (v), we have that |u| < |v| because the 
given grammar is context-sensitive. 

Now, given any production of Gg of order n > 2 we can derive a new equivalent 
grammar where that production has been replaced by a set of productions, each of 
which is of order strictly less than n. We have that every production u — v of Gs 
of order n > 2 can be of one of the following two forms: 

(i) u = P, Ppa and v = Q1Q20, where P,, P2, Q1, and Qo are nonterminal symbols 
and a € Vx and 6 € Vy, and 
(ii) u = Pı and v = Q1Q2, where P;, Qi, and Q2 are nonterminal symbols and 
BE Vy. 
In Case (i) we replace the production u — v by the productions: 
PP, > Th 
T >Q 
Tha > Q2ß 
where 7, and Tp are new nonterminal symbols. 
In Case (ii) we replace the production u — v by the productions: 
P> Gh 
Ti > Qı 
Th > Q28 


where T; and T are new nonterminal symbols. 
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Thus, by iterating the transformations of Cases (i) and (ii), we eventually get 
an equivalent grammar whose productions are all of order at most 2. A type 1 
production of order at most 2 can be of one of the following five forms: 
(1) A — B which is of the form (iv) of Definition 1.3.9, 
(2) A — BC which is of the form (i) of Definition 1.3.9, 
(3) AB — AC which is of the form (ii) of Definition 1.3.9, 
(4) AB — CB which is of the form (iii) of Definition 1.3.9, 
(5) AB — CD and this production can be replaced by the productions: 

AB — AT (which is of the form (ii) of Definition 1.3.9), 

AT — CT (which is of the form (iii) of Definition 1.3.9), and 

CT — CD (which is of the form (ii) of Definition 1.3.9), 
where T is a new nonterminal symbol. 

We leave it to the reader to check that after all the above transformations the 
derived grammar is equivalent to Gg and, thus, to G. 0 


There is a stronger form of the Kuroda Theorem because one can show that 
the productions of the forms (ii) and (iv) (or, by symmetry, those of the forms (iii) 
and (iv)) are not needed. 


EXAMPLE 1.3.12. We can replace the production ABCD — RSTUV whose 
order is 5, by the following three productions, whose order is at most 4: 

AB =? Ti To 

Tı —> R 

TCD — STUV 
where 7, and Tp are new nonterminal symbols. By this replacement the gram- 
mar where the production ABCD — RSTUV occurs, is transformed into a new, 
equivalent grammar. 

Note that we can replace the production ABCD — RSTUV also by the following 
two productions, whose order is at most 4: 

AB > RT, 

T CD — STUV 
where T, is a new nonterminal symbol. Also by this replacement, although it 
does not follow the rules indicated in the proof of the Kuroda Theorem, we get 
a new grammar which is equivalent to the grammar with the production ABCD — 


RSTUV. L 


With reference to the proof of the Kuroda Theorem (see Theorem 1.3.11), note that 
if we replace the production AB — CD by the two productions: AB — AD and 
AD — CD, we may get a grammar which is not equivalent to the given one. Indeed, 
consider, for instance, the grammar G whose productions are: 

S — AB 

AB —=— CD 

CD > aa 

AD — bb 
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We have that L(G) = {aa}. However, for the grammar G” whose productions are: 
S —>AB 
AB = AD 
AD = CD 
CD — aa 
AD — bb 


we have that L(G’) = {aa, bb}. 


1.4. Chomsky Normal Form and Greibach Normal Form 
Context-free grammars can be put into normal forms as we now indicate. 


DEFINITION 1.4.1. [Chomsky Normal Form. Version 1| A context-free 
grammar G = (Vr, Vy, P, S) is said to be in Chomsky normal form iff every produc- 
tion is of one of the following two forms, where A, B,C € Vy and a € Vr: 

(i) A> BC 
(ii) A-a 


This definition of the Chomsky normal form can be extended to the case when in 
the set P of productions we allow ¢-productions, that is, productions whose right 
hand side is the empty word £ (see Section 1.5). That extended definition will be 
introduced later (see Definition 3.6.1 on page 131). 

Note that by Theorem 1.3.6 on page 16, we may assume without loss of generality, 
that the axiom S' does not occur on the right hand side of any production. 


THEOREM 1.4.2. [Chomsky Theorem. Version 1| For every context-free 
grammar there exists an equivalent context-free grammar in Chomsky normal form. 


The proof of this theorem will be given later (see Theorem 3.6.2 on page 131). 


DEFINITION 1.4.3. [Greibach Normal Form. Version 1| A context-free 
grammar G = (Vr, Vy, P, S} is said to be in Greibach normal form iff every produc- 
tion is of the following form, where A € Vy, a € Vr, and a € Vx: 


A—aa 


As in the case of the Chomsky normal form, also this definition of the Greibach 
normal form can be extended to the case when in the set P of productions we allow 
é-productions (see Section 1.5). That extended definition will be given later (see 
Definition 3.7.1 on page 133). 


Also in the case of the Greibach normal form, by Theorem 1.3.6 on page 16 we 
may assume without loss of generality, that the axiom S' does not occur on the right 
hand side of any production, that is, a € (Vy — {S})*. 


THEOREM 1.4.4. [Greibach Theorem. Version 1] For every context-free 
grammar there exists an equivalent context-free grammar in Greibach normal form. 


The proof of this theorem will be given later (see Theorem 3.7.2 on page 133). 
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1.5. Epsilon Productions 
Let us introduce the following concepts. 


DEFINITION 1.5.1. [Epsilon Production] Given a grammar G = (Vr, Vy, P, S) 
a production of the form A — £, where A € Vy, is called an epsilon production. 


Instead of writing ‘epsilon productions’, we will feel free to write ‘e-productions’. 


DEFINITION 1.5.2. [Extended Grammar] For i = 0,1, 2,3, an extended type i 
grammar is a grammar (Vr, Vy, P, S) whose set of productions P consists of pro- 
ductions of type i and, possibly, n(>1) epsilon productions of the form: A; > €, 
..., An — £, where the A;’s are distinct nonterminal symbols. 


DEFINITION 1.5.3. |[S-extended Grammar] For i = 0,1,2,3, an S-ertended 
type i grammar is a grammar (Vr, Vy, P, S} whose set of productions P consists of 
productions of type ¿i and, possibly, the production S — €. 


Obviously, an S-extended grammar is also an extended grammar of the same 
type. 

We have that every extended type 1 grammar is equivalent to an extended 
context-sensitive grammar, that is, a context-sensitive grammar whose set of pro- 
ductions includes, for some n > 0, n epsilon productions of the form: A; — €, ..., 
An > £, where for i =1,...,n, A; E€ Vy. 

This property follows from the fact that, as indicated in the proof of Theo- 
rem 4.0.3 on page 172 (which generalizes Theorem 1.3.4 on page 13), the equivalence 
between type 1 grammars and context-sensitive grammars, is based on the transfor- 
mation of a single type 1 production into n (> 1) context-sensitive productions. 


We also have the following property: every S-extended type 1 grammar is equiv- 
alent to an S-extended context-sensitive grammar, that is, a context-sensitive gram- 
mar with, possibly, the production S —> e€. 

The following theorem relates the notions of grammars of Definition 1.2.1 with 
the notions of extended grammars and S-extended grammars. 


THEOREM 1.5.4. [Relationship Between S-extended Grammars and Ex- 
tended Grammars| (i) Every extended type 0 grammar is a type 0 grammar and 
vice versa. 

(ii) Every extended type 1 grammar is a type 0 grammar. 

(iii) For every extended type 2 grammar G such that € ¢ L(G), there exists an equiv- 
alent type 2 grammar. For every extended type 2 grammar G such that € € L(G), 
there exists an equivalent, S-extended type 2 grammar. 

(iv) For every extended type 3 grammar G such that € ¢ L(G), there exists an equiv- 
alent type 3 grammar. For every extended type 3 grammar G such that € € L(G), 
there exists an equivalent, S-extended type 3 grammar. 


PROOF. Points (i) and (ii) follow directly for the definitions. Point (iii) will 
be proved in Section 3.5.3 (see page 125). Point (iv) follows from Point (iii) and 
Algorithm 3.5.8 on page 126. Indeed, according to that algorithm, every production 
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of the form: A — a, where A € Vy and a € Vr is left unchanged, while every 
production of the form: A — aB, where A € Vy, a € Vr, and B € Vy, either is left 
unchanged or can generate a production of the form: A — a, where A € Vy and 
a € Vr. 


REMARK 1.5.5. The main reason for introducing the notions of the extended 
grammars and the S-extended grammars is the correspondence between S-extended 
type 3 grammars and finite automata which we will show in Chapter 2 (see Theo- 
rem 2.1.14 on page 33 and Theorem 2.2.1 on page 33). 


We have the following fact whose proof is immediate (see also Theorem 1.5.10). 


Fact 1.5.6. Let us consider a type 1 grammar G whose axiom is S. If we add 
to the grammar G the n(> 0) epsilon productions A, — €, ..., An — €, such 
that the nonterminal symbols Aj,...,A, do not occur on the right hand side of 
any production, then we get an equivalent grammar G’ which is an extended type 1 
grammar such that: 

(i) if S ¢ {A,,...,A,} then L(G) = L(G’) 
(ii) if S € {Aj,...,A,} then L(G) U {e} = L(G’). 


As a consequence of this fact and of Theorem 1.5.4 above, in the sequel we will often 
use the generalized notions of type 1, type 2, and type 3 grammars and languages 
which we introduce in the following Definition 1.5.7. As stated by Fact 1.5.9 below, 
these generalized definitions: (i) allow the empty word € to be an element of any 
language L of type 1, or type 2, or type 3, and also (ii) ensure that the language 
L — {e} is, respectively, of type 1, or type 2, or type 3, in the sense of the previous 
Definition 1.3.1. 

We hope that it will not be difficult for the reader to understand whether the 
notion of grammar (or language) we consider in each sentence throughout the book, 
is that of Definition 1.3.1 on page 13 or that of the following definition. 


DEFINITION 1.5.7. [Type 1, Context-Sensitive, Type 2, and Type 3 Pro- 
duction, Grammar, and Language. Version with Epsilon Productions| 
(1) Given a grammar G = (Vr, Vy, P, S) we say that a production in P is of type 1 
iff (1.1) either it is of the form a —> 8, where a € (Vp U Vyn)”, 6 € (Vr U Vy)”, and 
la| < |G|, or it is S > £, and (1.2) if the production S — € is in P then the axiom 
S does not occur on the right hand side of any production in P. 

(cs) Given a grammar (Vr, Vy, P, S), we say that a production in P is contezt- 
sensitive iff (cs.1) either it is of the form u Av — uwv, where u,v E€ V*, AE Vy, 
and w € (Vr UVy)*, or it is S — £, and (cs.2) if the production S — € is in P then 
the axiom S does not occur on the right hand side of any production in P . 

(2) Given a grammar G = (Vr, Vy, P, S) we say that a production in P is of type 2 
(or context-free) iff it is of the form a — 6, where a € Vy and 8 € V*. 

(3) Given a grammar G = (Vr, Vy, P, S) we say that a production in P is of type 3 
(or regular) iff it is of the form a — p, where a € Vy and 8 € {e} U Vr U VrVy. 

A grammar is of type 1, context-sensitive, of type 2, and of type 3 iff all its 
productions are of type 1, context-sensitive, of type 2, and of type 3, respectively. 
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A type 1, context-sensitive, type 2 (or context-free), and type 3 (or regular) language 
is a language generated by a type 1, context-sensitive, type 2, and type 3 grammar, 
respectively. 

As aconsequence of Theorem 4.0.3 on page 172 the notions of type 1 and context- 
sensitive grammars are equivalent and thus, the notions of type 1 and context- 
sensitive languages coincide. For this reason, instead of saying ‘a language of type 1’, 
we will also say ‘a context-sensitive language’ and vice versa. 

One can show (see Section 2.4 on page 39) that every type 3 (or regular) lan- 
guage can be generated by left linear grammars, that is, grammars in which every 
production is the form a — (3, where a € Vy and 8 € {e} U Vr U VyVr. 

REMARK 1.5.8. In the above definitions of type 2 and type 3 productions, we 
do not require that the axiom S does not occur on the right hand side of any pro- 
duction. Thus, it does not immediately follow from those definitions that also when 
epsilon productions are allowed, the grammars of type 0, 1, 2, and 3 do constitute a 
hierarchy, in the sense that, for 1=0,1, 2, the class of type i languages properly in- 
cludes the class of type 7+1 languages. However, as a consequence of Theorems 1.3.6 
and 1.5.4, and Fact 1.5.6, it is the case that they do constitute a hierarchy. 

Contrary to our Definition 1.5.7 above, in some textbooks (see, for instance, |9]) 
the production of the empty word € is not allowed for type 1 grammars, while it is 
allowed for type 2 and type 3 grammars, and thus, in that case the grammars of 
type 0, 1, 2, and 3 do constitute a hierarchy if we do not consider the generation of 
the empty word. 0 


We have the following fact which is a consequence of Theorems 1.3.6 and 1.5.4, 
and Fact 1.5.6. 


FAcT 1.5.9. A language L is a context-sensitive (or context-free, or regular) in 
the sense of Definition 1.3.1 iff the language LU {e} is context-sensitive (or context- 
free, or regular, respectively) in the sense of Definition 1.5.7. 


We also have the following theorem. 


THEOREM 1.5.10. [Salomaa Theorem for Type 1 Grammars] For every 
extended type 1 grammar G = (Vr, Vy, P, S) such that for every production of 
the form A — e, the nonterminal A does not occur on the right hand side of any 
production, there exists an equivalent S-extended type 1 grammar G” = (Vr, Vy U 
{S’, S1}, P’, S"), whose productions in P’ are of the form: 


(i) © — S'A (with A different from 5’) 
(ii) AB — AC (the left context is preserved) 
ii) AB — CB (the right context is preserved) 
(iv) A> B 
(v) A >a 

(vi) >e 
where A,B,C € Vx, a € Vr, and the axiom S’ occurs on the right hand side of 
productions of the form (i) only. The set P’ of productions includes the production 
S’ — e ife € L(G’). 
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PROOF. Let us consider the grammar G = (Vr, Vy, P, S). Since for each pro- 
duction A — e€, the nonterminal symbol A does not occur on the right hand side 
of any production of the grammar G, the symbol A may occur in a sentential form 
of a derivation of a word in L(G) starting from S, only if A is the axiom S and 
that derivation is S — £. Thus, by the Kuroda Theorem we can get a grammar G1, 
equivalent to G, whose axiom is S and whose productions are of the form: 


(i) A > BC 


(ii) AB — AC (the left context A is preserved) 
(iii) AB — CB (the right context B is preserved) 
(iv) A>B 

(v) A-a 

(vi) Se 


where A,B, and C are nonterminal symbols in Vy (thus, they may also be S) 
and the production S — £ belongs to the set of productions of the grammar G, 
iff e € L(G,). Now let us consider two new nonterminal symbols S’ and Sı and 
the grammar G2 =aep (Vr, Vn U {5", Si}, Po, S’) with axiom S’ and the set P, of 
productions which consists of the following productions: 


1 S — rS 
2 "3S 

and for each nonterminal symbol A of the grammar Gj, the productions: 
3. QA>AS 


4. AG> A 
and for each production A — BC of the grammar G1, the productions: 

5. AS; — BC 
and the productions of the grammar G of the form: 

6. AB — AC (the left context A is preserved) 

7. AB—CB (the right context B is preserved) 

8 A >B 

9 A -a 
and the production: 

10. S’—+e iff S +e is a production of G4. 
Now we show that L(G,) = L(G2) by proving the following two properties. 
Property (P1): for any w € Vz, if S’ -G, w and w € L(G) then S %, w. 
Property (P2): for any w € Vz, if S >%, w and w € L(G) then S’ >@, w. 
Properties (P1) and (P2) are obvious if w = e. For w Æ £ we reason as follows. 
Proof of Property (P1). The derivation of w from S in the grammar G can be 
obtained as a subderivation of the derivation of w from S” in the grammar Gə after 
removing in each sentential form the nonterminal S4. 
Proof of Property (P2). If S -%, w and w € L(G) then S(S;)" —%, w for some 
n> 0. Indeed, the productions $;A — AS, and AS; — S,A can be used in the 
derivation of a word w using the grammar G2, for inserting copies of the symbol 
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Sı where they are required for applying the production AS, — BC to simulate the 
effect of the production A — BC in the derivation of a word w € L(G) using the 
grammar Gj. 

Since S >@, S"(S1)" >c, S (S1)” for all n >0, the proof of Property (P2) is 
completed. This also concludes the proof that L(G,) = L(G). 

Now from the grammar G2, we can get the desired grammar G” with the produc- 
tions of the desired form, by replacing every production of the form: AB — CD by 
three productions of the form: AB — AT, AT — CT, and CT — CD, where T isa 
new nonterminal symbol. We leave it to the reader to prove that L(G) = L(G’). O 


Note that while in the Kuroda normal form (see Definition 1.3.9 on page 17) we 
have, among others, some productions of the form A — BC, where A, B, and C 
are nonterminal symbols, here in the Salomaa Theorem (see Theorem 1.5.10 on 
page 22) the only form of production in which a single nonterminal symbol produces 
two nonterminal symbols is of the form S’ — S’ A, where S’ is the axiom of the 
grammar and A is different from S”. Thus, the Salomaa Theorem can be viewed 
as an improvement with respect to the Kuroda Theorem (see Theorem 1.3.11 on 
page 17). 


1.6. Derivations in Context-Free Grammars 


For context-free grammars we can associate a derivation tree, also called a parse 
tree, with every derivation of a word w from the axiom S. 

Given a context-free grammar G = (Vr, Vy, P, S), and a derivation of a word w 
from the axiom S, that is, a sequence a; —> a2 —> ... > Qn of n (> 1) sentential 
forms such that, in particular, a; = S and a, =w (see Definition 1.2.7 on page 12), 
the corresponding derivation tree T is constructed as indicated by the following two 
rules. 

Rule (1). The root of T is a node labeled by S. 
Rule (2). For any i = 1,...,n—1, let us consider in the given derivation 

Oy Ol cs EAn 
the i-th derivation step a; — a;4,. Let us assume that in that derivation step we 
have applied the production A — (3, where: 

(i) AE Vy, 

(ii) 8 = c... Cp, for some k > 0, and 

(iii) for 7 EN Cj € Vn U Vr. 

In the derivation tree constructed so far, we consider the leaf-node labeled by the 
symbol A which is replaced by ( in that derivation step. 

If k > 1 then we generate k son-nodes of that leaf-node and they will be labeled, 
from left to right, by c1,...,Ck, respectively. (Obviously, after the generation of 
these k son-nodes, the leaf-node labeled by A will no longer be a leaf-node and will 
become an internal node of the new derivation tree.) 

If k = 0 then we generate one son-node of the node labeled by A. The label of 
that new node will be the empty word e. 
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When all the derivation steps of the given derivation a; — ag >... — Qn have 
been considered, the left-to-right concatenation of the labels of all the leaves of the 
resulting derivation tree T is the word w. 

The word w is said to be the yield of the derivation tree T. 


EXAMPLE 1.6.1. Let us consider the grammar whose productions are: 


S—aAS 
Sa 

A—SbA 
A— ba 
A-SS 


with axiom S. Let us also consider the following derivation: 
D: S > aAS > aSbAS — aabAS — aabbaS — aabbaa 
(1) (2) (3) (4) (5) 


where in each sentential form a; we have underlined the nonterminal symbol which 
is replaced in the derivation step a; — a;4,. The corresponding derivation tree 
is depicted in Figure 1.6.1 on page 25. In the above derivation D the numbers 
below the underlined nonterminal symbols, denote the correspondence between the 
derivation steps and the nodes with the same number in the derivation tree depicted 
in Figure 1.6.1. 


S (1) 
| 
Ne he 
WE |X | 
S(3) b <A(4) a 
| / \ 
a b a 


FIGURE 1.6.1. A derivation tree for the word aabbaa and the gram- 
mar given in Example 1.6.1 on page 25. This tree corresponds to the 
derivation D: S — aA S — aSbAS — aabA S — aabbaS — aabbaa. 
The numbers associated with the nonterminal symbols denote the 
correspondence between the nonterminal symbols and the derivation 
steps of the derivation D on page 25. 


Given a word w and a derivation aj > Qj >... > Qn, with n > 1, where ay = S 
and a, =w, for a context-free grammar, we say that it is a leftmost derivation of w 
from S iff for i =1,...,n—1, in derivation step a; — @i+ı the nonterminal symbol 


which is replaced in the sentential form a;, is the leftmost nonterminal in a;. A 
derivation step a; — Q;41 in which we replace the leftmost nonterminal in a;, is also 
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denoted by a; >m Q;41. The derivation D in the above Example 1.6.1 is a leftmost 
derivation. 

Similarly to the notion of leftmost derivation, there is also the notion of right- 
most derivation where at each derivation step the rightmost nonterminal symbol is 
replaced. A rightmost derivation step is usually denoted by >. 


THEOREM 1.6.2. Given a context-free grammar G, for every word w € L(G) 
there exists a leftmost derivation of w and a rightmost derivation of w. 


PROOF. The proof is by structural induction on the derivation tree of w. 0 


EXAMPLE 1.6.3. Let us consider the grammar whose productions are: 
E+T 

T 

TxF 

F 

(E) 

a 


i e Bes 
EEEL LE 


with axiom Æ. Let us also consider the following three derivations D1, D2, and D3, 
where for each derivation step a; — Qi+1, we have underlined in the sentential form 
a; the nonterminal symbol which is replaced in that derivation step: 

Di: E >m EST >mT HT >m FAT >ma+T—>ma+F >ma+a 

D2: E >m E +T >m EHE >m E +0 >mTta—>mEta—>mata 
D3: E >m E +L >m E+ F >mT HE >m L+Ha—>mE+a—>ma+a 
We have that: (i) derivation D1 is leftmost, (ii) derivation D2 is rightmost, and 
(iii) derivation D3 is neither rightmost nor leftmost. 


Let us also introduce the following definition which we will need later. 


DEFINITION 1.6.4. [Unfold and Fold of a Context-Free Production] Let 
us consider a context-free grammar G = (Vr, Vy, P, S}. Let A,B be elements of 
Vy and a, (1,..-, Bn, be elements of (Vr U Vy)*. Let A — aBy be a production 
in P, and B > B,|... | 8n be all the productions in P whose left hand side is B. 

The unfolding of B in A— aBy with respect to P (or simply, the unfolding of 
B in A — aBy) is the replacement of 

the production: A—aBy_ by the productions: A —> aßıy| -.. | aBny. 
Conversely, let A — abßıy| ... |abny be some productions in P whose left hand 
side is A, and B — ĝı |... | 3n be all the productions in P whose left hand side 
is B. 

The folding of 61,..., Bn in A — abı] ... | any with respect to P (or simply, 
the folding of (1,..., Bn in A —> aßıy| -.. | a@Bny) is the replacement of 


the productions: A—afyy|...|a8ny by the production: A— aBy. 


Sometimes, instead of saying ‘unfolding of B in A — aBy with respect to P’, we 
will free to say ‘unfolding of B in A— aBy by using P’. 
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DEFINITION 1.6.5. [Left Recursive Context-Free Production and Left 
Recursive Context-Free Grammar] Let us consider a context-free grammar G = 
(Vr, Vv, P, S). We say that a production in P is left recursive if it is the form: 
A — Aa with A € Vy and a € V*. A context-free grammar is said to be left 
recursive if one of its productions is left recursive. 


The reader should confuse this notion of left recursive context-free grammar with 
the one of Definition 3.5.19 on page 130. 


1.7. Substitutions and Homomorphisms 


In this section we introduce some notions which will be useful in the sequel for 
stating various closure properties of some classes of languages we will consider. 


DEFINITION 1.7.1. [Substitution| Given two alphabets © and Q, a substitution 
is a mapping which takes a symbol of X, and returns a language subset of Q*. 


Any substitution co with domain © can be canonically extended to a mapping 
cı, also called a substitution, which takes a word in &* and returns a language subset 
of Q*, as follows: 


(1) oi(e) = te} 

(2) o1(wa) = o1(w) * cola) for any w € &* anda € X 
where the operation + denotes the concatenation of languages. (Recall that for 
every symbol a € © the value of o9(a) is a language subset of Q*, and also for every 
word w € L C &* the value of cı (w) is a language subset of Q*.) Since concatenation 
of languages is associative, Equation (2) above can be replaced by the following one: 

(2*) oilai ...an) = colar)" ... * golan) forany n> 0 
Any substitution 0; with domain &* can be canonically extended to a mapping 02, 
also called a substitution, which takes a language subset of &* and returns a language 
subset of Q*, as follows: for any L C X*, 


ga bi = p aE 
= {z| z € oola)" ... * Colan) for some word a1...an € L} 
Since substitutions have canonical extensions and also these extensions are called 
substitutions, in order to avoid ambiguity, when we introduce a substitution we have 
to indicate its domain and its codomain. However, we will not do so when confusion 
does not arise. 

DEFINITION 1.7.2. [Homomorphism and ¢-free Homomorphism| Given 
two alphabets © and 2, a homomorphism is a total function which maps every 
symbol in % to a word w € Q". A homomorphism h is said to be é-free iff for every 
a€, h(a) #e. 

Note that sometimes in the literature (see, for instance, [9, pages 60 and 61]), 
given two alphabets © and Q, a homomorphism is defined as a substitution which 
maps every symbol in © to a language L € Powerset(Q*) with exactly one word. 
This definition of a homomorphism is equivalent to ours because when dealing with 
homomorphisms, one can assume that for any given word w € Q*, the singleton 
language {w} € Powerset(Q*) is identified with the word w itself. 
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As for substitutions, a homomorphism h from © to Q* can be canonically ex- 
tended to a function, also called a homomorphism and denoted h, from Powerset(d*) 
to Powerset(Q*). Thus, given any language L C &*, the homomorphic image under 
h of L is the language h(L) which is a subset of Q*. 


EXAMPLE 1.7.3. Given © = {a,b} and Q = {0,1}, let us consider the homomor- 
phism h : X — Q* such that 


h(a) = 0101 
h(b) = 01 
We have that h({b, ab, ba, bbb}) = {01, 010101}. 


DEFINITION 1.7.4. [Inverse Homomorphism and Inverse Homomorphic 
Image] Given a homomorphism h from © to Q* and a language V, subset of Q*, 
the inverse homomorphic image of V under h, denoted h~1(V), is the following 
language, subset of U*:  h7'(V) = {x|a € D* and A(x) € V}. 

Given a language V, the inverse h~! of an e-free homomorphism h returns a new 
language L by replacing every word v of V by either zero or one or more words, each 
of which is not longer than v. 


EXAMPLE 1.7.5. Let us consider the homomorphism h of Example 1.7.3 on 
page 28. We have that 


h-!({010101}) = {ab, ba, bbb} 
h-!({0101, 010, 10}) = {a, bb} 


Given two alphabets © and Q, a language L C X*, and a homomorphism h which 
maps L into a language subset of Q*, we have that: 
(i) LC hl (A(L)) and (ii) A(h 1(L)) CL 


Note that these Properties (i) and (ii) actually hold for any function, not necessarily 
a homomorphism, which maps a language subset of * into a language subset of Q*. 


DEFINITION 1.7.6. [Inverse Homomorphic Image of a Word] Given a homo- 
morphism A from © to Q*, and a word w of Q*, we define the inverse homomorphic 
image of w under h, denoted h~'(w), to be the following language subset of D*: 


A (w) = {x| x € E* and h(x) = w}. 


EXAMPLE 1.7.7. Let us consider the homomorphism h of Example 1.7.3 on 
page 28. We have that h~'(0101) = {a, bb}. 


We end this section by introducing the notion of a closure of a class of languages 
under a given operation. 


DEFINITION 1.7.8. [Closure of a Class of Languages| Given a class C of 
languages, we say that C is closed under a given operation f of arity n iff f applied 
to n languages in C returns a language in C. 


This closure notion will be used in the sequel and, in particular, in Sections 2.12, 
3.13, 3.17, and 7.5, starting on page 94, 157, 169, and 224, respectively. 


CHAPTER 2 


Finite Automata and Regular Grammars 


In this chapter we will introduce the deterministic finite automata and the nondeter- 
ministic finite automata and we will show their equivalence (see Theorem 2.1.14 on 
page 33). We will also prove the equivalence between deterministic finite automata 
and S-extended type 3 grammars. We will introduce the notion of regular expres- 
sions (see Section 2.5) and we will prove the equivalence between regular expressions 
and deterministic finite automata. We will also study the problem of minimizing 
the number of states of the finite automata and we will present a parser for type 3 
languages. Finally, we will introduce some generalizations of the finite automata and 
we will consider various closure and decidability properties for type 3 languages. 


2.1. Deterministic and Nondeterministic Finite Automata 
The following definition introduces the notion of a deterministic finite automaton. 


DEFINITION 2.1.1. [Deterministic Finite Automaton] A deterministic finite 
automaton (also called finite automaton, for short) over the finite alphabet X (also 
called the input alphabet) is a quintuple (Q, ©, qo, F, ô) where: 

- Q is a finite set of states, 

- qo is an element of Q, called the initial state, 

- F C Q is the set of final states, and 

- ô is a total function, called the transition function, from Q x © to Q. 


A finite automaton is usually depicted as a labeled multigraph whose nodes are the 
states and whose edges represent the transition function as follows: for every state 
qı and q> and every symbol v in X, if 6(q,v) = q2 then there is an edge from node 
qı to node q2 with label v. 

If we have that 6(q1,v1) = q2 and ... and 6(q@, Un) = Q2, for some n > 1, we will 
feel free to depict only one edge from node qı to node q2, and that edge will have 
the n labels v1,...,Un, separated by commas (see, for instance, Figure 2.1.2 (3) on 
page 32). 

Usually the initial state is depicted as a node with an incoming arrow and the 
final states are depicted as nodes with two circles (see, for instance, Figure 2.1.1 on 
page 31). We have to depict a finite automaton using a multigraph, rather than a 
graph, because between any two nodes there can be, in general, more than one edge. 


Let 6* be the total function from Q x &* to Q defined as follows: 


(i) for every q € Q, 6*(q,¢) =q, and 
(ii) for every q € Q, for every word wv with w € S* and v E€ È, 
O*(q, wu) = 6(6*(q, w), v). 
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For every w1, wz E€ ©X* we have that 6*(q, wiw) = 6*(d*(q, w1), w2). 

Given a finite automaton, we say that there is a w-path from state qı to state qo 
for some word w € &* iff ô* (q1, w) = q2. 

When the transition function 6 is applied, we say that the finite automaton 
makes a move (or a transition). In that case we also say that a state transition 
takes place. 


REMARK 2.1.2. [Epsilon Moves] Note that since in each move one symbol of 
the input is given as an argument to the transition function 6, we say that a finite 
automaton is not allowed to make e-moves (see the related notion of an ¢-move for 
a pushdown automaton introduced in Definition 3.1.5 on page 101). 


DEFINITION 2.1.3. [Equivalence Between States of Finite Automata] 
Given a finite automaton (Q, X, qo, F, ô) we say that a state qı E€ Q is equivalent to 
a state q2 € Q iff for every word w € %* we have that 6*(q,, w) E€ F iff ô* (qo, w) E F. 


As a consequence of this definition, given a finite automaton (Q, £, qo, F, 4), if a 
state qı is equivalent to a state q2 then for every v € X, the state ô(q1, v) is equivalent 
to the state 6(q,v). (Note that this statement is not an ‘iff’.) 


DEFINITION 2.1.4. [Language Accepted by a Finite Automaton] We say 
that a finite automaton (Q, X£, qo, F,46) accepts a word w in X* iff 6*(qo,w) E F. A 
finite automaton accepts a language L iff it accepts every word in L and no other 
word. If a finite automaton M accepts a language L, we say that L is the language 
accepted by M. L(M) denotes the language accepted by the finite automaton M. 


When introducing the concepts of this definition, other textbooks use the terms ‘rec- 
ognizes’ and ‘recognized’, instead of the terms ‘accepts’ and ‘accepted’, respectively. 
The set of languages accepted by the set of the finite automata over X} is denoted 
La,» or simply Lra, when X is understood from the context. We will prove that 
La,» is equal to REG, that is, the class of all regular languages subsets of *. 


DEFINITION 2.1.5. [Equivalence Between Finite Automata| Two finite au- 
tomata are said to be equivalent iff they accept the same language. 


A finite automaton can be given by providing: (i) its transition function ô, (ii) its 
initial state qo, and (iii) its final states F’. Indeed, from the transition function, we 
can derive the input alphabet © and the set of states Q. 


EXAMPLE 2.1.6. In the following Figure 2.1.1 we have depicted a finite automa- 
ton which accepts the empty string £ and the binary numerals denoting the natural 
numbers that are divisible by 3. The numerals are given in input to the finite 
automaton, starting from the most significant bit and ending with the least signif- 
icant bit. Thus, for instance, if we want to give in input to a finite automaton the 
number 2°71 b + 2°72 b> +...+ 21 bn_1 + 2° bn, we have to give in input the string 
bıb2 . . .bn—1bn of bits in the left-to-right order. 

Starting from the initial state 0, the finite automaton will be in state 0 if the input 
examined so far is the empty string £ and it will be in state x, with x € {0,1,2}, 
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if the input examined so far is the string b,b2...b;, for some j = 1,...,n, which 
denotes the integer k and k divided by 3 gives the remainder x, that is, there exists 
an integer m such that k = 3m + z. 

The correctness of the finite automaton depicted in Figure 2.1.1 is proved as 
follows. The set of states is {0,1,2}, because the remainder of a division by 3 can 
only be either 0 or 1 or 2. From state 0 to state 1 there is an arc labeled 1 because 
if the string bıbə...b; of bits denotes an integer k divisible by 3 (and thus, there is 
a (b,b2...b;)-path which leads from the initial state 0 again to state 0), then the 
extended string b)b2...b;1 denotes the integer 2k + 1, and thus, when we divide 
2k+ 1 by 3 we get the integer remainder 1. Analogously, one can prove the labels 
of all other arcs of the finite automaton depicted in Figure 2.1.1 are correct. 


FIGURE 2.1.1. A deterministic finite automaton which accepts the 
empty string £ and the binary numerals denoting natural numbers 
divisible by 3 (see Example 2.1.6). For instance, this automaton ac- 
cepts the binary numeral 10010 which denotes the number 18, because 
10010 leads from the initial state 0 again to state 0 (which is also a 
final state) through the following sequence of states: 1, 2, 1, 0, 0. 


REMARK 2.1.7. Finite automata can also be introduced by stipulating that the 
transition function is a partial function from Q x È to Q, rather than a total function 
from Q xX to Q. If we do so, we get an equivalent notion of finite automata. Indeed, 
one can show that for every finite automaton with a partial transition function, there 
exists a finite automaton with a total transition function which accepts the same 
language, and vice versa. 

We will not formally prove this statement and, instead, we will provide the 
following example which illustrates the proof technique. This technique uses a so 
called sink state for constructing an equivalent finite automaton with a total tran- 
sition function, starting from a given finite automaton with a partial transition 
function. o 


EXAMPLE 2.1.8. Let us consider the finite automaton ({S, A}, {0,1}, S, {.5, A}, 
ô), where 6 is the following partial transition function: 


6(S,0)=S 6(S,1)=A_ 6(A,0)=S. 
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This automaton is depicted in Figure 2.1.2 (a). In order to get the equivalent finite 
automaton with a total transition function we consider the additional state q,, called 
the sink state, and we stipulate that (see Figure 2.1.2 (8)): 


(A, 1) = qs ô(qs, 0) = qs 6(qs, 1) = qs. 


FIGURE 2.1.2. (a) The deterministic finite automaton of Exam- 
ple 2.1.8 with a partial transition function. (8) A deterministic finite 
automaton equivalent to the finite automaton in (a). This second 
automaton has the sink state q, and a total transition function. 


DEFINITION 2.1.9. [Nondeterministic Finite Automaton] A nondetermin- 
istic finite automaton is like a finite automaton, with the only difference that the 
transition function 6 is a total function from Q x E to 2°, that is, from Q x E to 
the set of the finite subsets of Q. Thus, the transition function ô returns a subset 
of states, rather than a single state. 


REMARK 2.1.10. According to Definitions 2.1.1 and 2.1.9, when we say ‘finite 
automaton’ without any other qualification, we actually mean a ‘deterministic finite 
automaton’. 


Similarly to a deterministic finite automaton, a nondeterministic finite automa- 
ton is depicted as a labeled multigraph. In this multigraph for every state qı and qo 
and every symbol v in &, if q2 € 6(q,v) then there is an edge from node qı to node 
q2 with label v. The fact that the finite automaton is nondeterministic implies that 
there may be more than one edge with the same label going out of a given node. 


Obviously, every deterministic finite automaton can be viewed as a particular 
nondeterministic finite automaton whose transition function 6 returns singletons 
only. 

Let 6* be the total function from 2° x X* to 2? defined as follows: 


(i) for every AC Q, 6*(A,e) =A, and 
(ii) for every A C Q, for every word wv, with w € &* and v € X, 
õ* (A, wv) = Uge s{aw) 9(% v). 
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Given a nondeterministic finite automaton, we say that there is a w-path from 
state qı to state q2 for some word w € b* iff q2 E€ O*({q }, w). 


DEFINITION 2.1.11. [Language Accepted by a Nondeterministic Finite 
Automaton] A nondeterministic finite automaton (Q, X, qo, F, ô} accepts a word w 
in %* iff there exists a state in 6*({qo}, w) which belongs to F. A nondeterministic 
finite automaton accepts a language L iff it accepts every word in L and no other 
word. If a nondeterministic finite automaton M accepts a language L, we say that 
L is the language accepted by M. 


When introducing the concepts of this definition, other textbooks use the terms ‘rec- 
ognizes’ and ‘recognized’, instead of the terms ‘accepts’ and ‘accepted’, respectively. 


DEFINITION 2.1.12. [Equivalence Between Nondeterministic Finite Auto- 
mata] Two nondeterministic finite automata are said to be equivalent iff they accept 
the same language. 


REMARK 2.1.13. As for deterministic finite automata, one may assume that the 
transition functions of the nondeterministic finite automata are partial function, 
rather than total functions. Indeed, by using the sink state technique one can show 
that for every nondeterministic finite automaton with a partial transition function, 
there exists a nondeterministic finite automaton with a total transition function 
which accepts the same language, and vice versa. o 


We have the following theorem. 


THEOREM 2.1.14. |Rabin-Scott. Equivalence of Deterministic and Non- 
deterministic Finite Automata] For every nondeterministic finite automaton 
(Q, ©, qo, F, ô) there exists an equivalent, deterministic finite automaton whose set 
of states is a subset of 2°. 


This theorem will be proved in Section 2.3. 


2.2. Nondeterministic Finite Automata and S-extended Type 3 
Grammars 


In this section we establish a correspondence between the set of the S-extended 
type 3 grammars whose set of terminal symbols is © and the set of the nondeter- 
ministic finite automata over ©. 


THEOREM 2.2.1. [Equivalence Between S-extended Type 3 Grammars 
and Nondeterministic Finite Automata] (i) For every S-extended type 3 gram- 
mar which generates the language L C X*, there exists a nondeterministic finite 
automaton over © which accepts L and (ii) vice versa. 


PROOF. Let us show Point (i). Given the S-extended type 3 grammar (Vr, Vy, P, S) 
we construct the nondeterministic finite automaton (Q, Vr, S,F,6) as indicated by 
the following procedure. Note that S € Q is the initial state of the nondeterministic 
finite automaton. 
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ALGORITHM 2.2.2. 
Procedure: from S-extended Type 3 Grammars to Nondeterministic Finite Au- 
tomata. 


Q:=Vy; F:=ģ; os: 

for every production p in P: 

begin 

if pis A—aB then update ô by adding B to the set ô(A, a); 


if pis A— a then begin introduce a new final state q; 

Q :=QU{q}]; F:= FU {q}; 

update ô by adding qtotheset 6(A, a) end; 
ifpisS—e then F:= FU{S}; 


end 


We leave it to the reader to show that the language generated by the S-extended 
type 3 grammar (Vr, Vy, P, S) is equal to the language accepted by the automaton 
(Q, Vr, S, F, ô). 

Let us show Point (ii). Given a nondeterministic finite automaton (Q, ©, qo, F, 6) 
we define the S-extended type 3 grammar (X, Q, P, qo), where qo is the axiom and 
the set P of productions is constructed as indicated by the following procedure. 


ALGORITHM 2.2.3. 
Procedure: from Nondeterministic Finite Automata to S-extended Type 3 Gram- 
mars. 
P := Q; 
for every state A and B and for every symbol a such that B € 6(A, a): 
begin add to P the production A — aB; 
if B € F then add to P the production A —> a 


end; 


if qo E€ F then add to P the production qo — € 


In the for-loop of this procedure, one looks at every state, one at a time, and 
for each state at every outgoing edge. The S-extended regular grammar which is 
generated by this procedure can then be simplified by eliminating useless symbols 
(see Definition 3.5.5 on page 125), if any. 

We leave it to the reader to show that the finite automaton (Q, £, qo, F, 6) accepts 
the language which is generated by the grammar (£, Q, P, qo). 


2.3. FINITE AUTOMATA AND TRANSITION GRAPHS 35 


EXAMPLE 2.2.4. Let us consider the S-extended type 3 grammar: 
({0,1}, {S,B}, {S — 0B, B — 0B|15|0}, S}. 


We get the nondeterministic finite automaton ({S, B, Z}, X, S, {Z}, 6), depicted in 
Figure 2.2.1. The transition function ô is defined as follows: 


5(S,0) ={B}, 6(B,0)={B,Z}, and 6(B,1) = {S}. 


FIGURE 2.2.1. The nondeterministic finite automaton of Example 2.2.4. 


EXAMPLE 2.2.5. Let us consider the deterministic finite automaton ({5, A}, 
{0,1}, S, {S,A}, 6), where 6(S,0) = S, 6(S,1) = A, and 6(A,0) = S (see Fig- 
ure 2.1.2 (a) on page 32). This automaton can also be viewed as a nondeterministic 
finite automaton whose partial transition function is: 6(S,0) = {S}, 6(S,1) = {A}, 
and 6(A,0) = {S}. The language accepted by this automaton is: 

{w|w € {0,1}* and w does not contain two consecutive 1’s}. 

We get the following S-extended type 3 grammar: 


({0,1}, {5A}, {9 > e]0S]0/1A]1, A050}, S). 


2.3. Finite Automata and Transition Graphs 


In this section we will introduce the notion of a transition graph and we will prove 
the Rabin-Scott Theorem (see Theorem 2.1.14 on page 33). 


DEFINITION 2.3.1. [Transition Graph] A transition graph (Q, £, qo, F, ô) over 
the alphabet X is a multigraph like that of a nondeterministic finite automaton over 
È, except that the transition function 6 is a total function from Q x (SU {e}) to 28 
such that for any q € Q, q € ô(q, €). 


Similarly to a deterministic or a nondeterministic finite automaton, a transition 
graph can be depicted as a labeled multigraph. The edges of that multigraph are 
labeled by elements in X U {e}. 

Note that in the above Definition 2.3.1 we do not assume that for any q € Q, if 
qı € 6(q,€) and q2 € ô(q1, €) then q2 € (q, €). 

We have that every nondeterministic finite automaton can be viewed as a par- 
ticular transition graph such that for any q € Q, 6(q,¢) = {q}. 

Every deterministic finite automaton can be viewed as a particular transition 
graph such that: (i) for every q € Q, 6(q,¢) = {q}, and (ii) for every q € Q and 
v E€ &, ô(q, v) is a singleton. 
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DEFINITION 2.3.2. For every transition graph with transition function 6 and for 
every a € © U {£}, we define a binary relation => which is a subset of Q x Q, as 
follows: 
for every da, € Q, we stipulate that qa —> q iff there exists a sequence of states 
(qı, q2,- --, Gir Gitis--- sdn)» with 1 < i< n, such that: 

(i) q =a 
(ii for j = 1,...,i— 1, qj+1 E lq, €) 
Gii) gi+1 € ôlq a) 
for j =i+1,...,n—1, Qj+1 E ô(qj, £) 
Qn = Q- 


(iv 
(v 


Since for any q € Q, q € ô(q, €), we have that for every state q € Q, q => q. 

For every transition graph with transition function ô we define a total function 
6* from 2° x ©* to 2® as follows: 
(i) for every set A C Q, 6*(A,e) = {q | there exists p € A and p => q} , and 
(ii) for every set A C Q, for every word wv with w € &* and v € X, 

ô*(A, wv) = {q | there exists p € 6*(A,w) and p => q}. 
Given a transition graph, we say that there is a w-path from state qı to state q2 for 
some word w € X* iff q2 € 6*({q}, w). Thus, given a subset A of Q and a word w 
in U*, 6*(A, w) is the set of all states q such that there exists a w-path from a state 
in A to q. 


) 
) 
) 
) 


DEFINITION 2.3.3. [Language Accepted by a Transition Graph] We say 
that a transition graph (Q, X, qo, F, ô) accepts a word w in X* iff there exists a state 
in 0*({qo},w) which belongs to F. A transition graph accepts a language L iff 
it accepts every word in L and no other word. If a transition graph T accepts a 
language L, we say that L is the language accepted by T. 


When introducing the concepts of this definition, other textbooks use the terms ‘rec- 
ognizes’ and ‘recognized’, instead of the terms ‘accepts’ and ‘accepted’, respectively. 

We will prove that the set of languages accepted by the transition graphs over 
X is equal to REG, that is, the class of all regular languages subsets of U*. 


DEFINITION 2.3.4. [Equivalence Between Transition Graphs] Two transi- 
tion graphs are said to be equivalent iff they accept the same language. 


REMARK 2.3.5. As for deterministic finite automata and nondeterminisic finite 
automata, one may assume that the transition functions of the transition graphs 
are partial functions, rather than total functions. Indeed, by using the sink state 
technique one can show that for every transition graph with a partial transition func- 
tion, there exists a transition graph with a total transition function which accepts 
the same language, and vice versa. oO 


We have the following Theorem 2.3.7 which is a generalization of the Rabin-Scott 
Theorem (see Theorem 2.1.14). We need first the following definition. 
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DEFINITION 2.3.6. [Image of a Set of States with respect to a Symbol] 
For every subset S of Q and every a € X}, the a-image of S is the following subset 


of Q: 


{q> | there exists a state qı € S and q = qo}. 


THEOREM 2.3.7. [Rabin-Scott. Equivalence of Finite Automata and 
Transition Graphs] For every transition graph T over the alphabet ©, there exists 
a deterministic finite automaton D which accepts the same language. 


PROOF. The proof is based on the following procedure, called the Powerset Con- 
struction. 


ALGORITHM 2.3.8. 
The Powerset Construction. Version 1: from Transition Graphs to Finite Automata. 


Given a transition graph T = (Q,™%,q, F, ô), we construct a deterministic finite 
automaton D which accepts the same language, subset of &*, as follows. 


The set of states of the finite automaton D is 2°, that is, the powerset of Q. 


The initial state I of D is equal to 6*({qo},¢) C Q. (That is, the initial state of D is 
the smallest subset of Q which consists of every state q for which there is an -path 
from qo to q. In particular, go € J). 


A state of D is final iff it includes a state from which there is an ¢-path to a state 
in F. (In particular, a state of D is final if it includes a state in F.) 


The transition function 7 of the finite automaton D is defined as follows: for every 
pair Sı and S> of subsets of Q and for every a € X, ($1, a) = S2 iff Sy is the a-image 
of Sı. 


We leave it to the reader to show that the language accepted by the transition 
graph T is equal to the language accepted by the finite automaton D constructed 
according to the Powerset Construction Procedure. (That proof can be done by 
induction on the length of the words accepted by T and D.) 


The finite automaton D which is constructed by the Powerset Construction Pro- 
cedure starting from a given transition graph T, can be kept to its smallest size if 
we take the set of states of D to be the set of states reachable from J, that is, 


{q | there exists a w-path from the initial state I to q, for some w € X*}. 


EXAMPLE 2.3.9. Let us consider the following transition graph (whose transition 
function is a partial function): 
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By applying the Powerset Construction we get a finite automaton (see Figure 2.3.1) 
whose transition function 6 is given by the following table, where we have underlined 
the final states: 


input 


state |} 0 1 


Note that in this table a state {q,...,q@} in 2° has been named q1... qx. 


NOTATION 2.3.10. In the sequel, we will use the convention we have used in the 
above table, and we will underline the names of the states when we want to stress 
the fact that they are final states. 


For instance, the entry 123 for state 1 and input 1 is explained as follows: (i) from 
state 1 via the arc labeled 1 followed by the arc labeled £ we get to state 1, (ii) from 
state 1 via the arc labeled 1 we get to state 2, and (iii) from state 1 via the arc 
labeled 1 we get to state 3. Thus, from state 1 for the input 1 we get to a state 
which we call 123, and since this state is a final state (because state 2 is a final 
state in the given transition graph) we have underlined its name and we write 123, 
instead of 123. o 

Similarly, the entry 12 for state 123 and input 0 is explained as follows: (i) from 
state 1 via the arc labeled 0 we get to state 2, (ii) from state 3 via the arc labeled 0 
we get to state 1, and (iii) from state 3 via the arc labeled £ followed by the arc 
labeled 0 we get to state 2. Thus, from state 123 for the input 0 we get to a state 
which we call 12, and since this state is a final state (because state 2 is final in the 
given transition graph) we have underlined its name and we write 12, instead of 12. 

An entry ‘—’ in row r and column c of the above table means that from state r 
for the input c it is not possible to get to any state, that is, the transition function 
is not defined for state r and input symbol c. 0 
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FIGURE 2.3.1. The finite automaton corresponding to the transition 
graph of Example 2.3.9. 


Since nondeterministic finite automata are particular transition graphs, the Powerset 
Construction is a procedure which for any given nondeterministic finite automaton 
(Q, È, qo, F, ô) constructs a deterministic finite automaton D which accepts the same 
language. This fulfills the promise of providing a proof of Theorem 2.1.14 on page 33. 
Moreover, since in a nondeterministic finite automaton there are no edges labeled 
by the empty string £, the Powerset Construction can be simplified as follows when 
we are given a nondeterministic finite automaton, rather than a transition graph. 


ALGORITHM 2.3.11. 
Powerset Construction. Version 2: from Nondeterministic Finite Automata to Fi- 
nite Automata. 


Given a nondeterministic finite automaton N = (Q, £, qo, F, ô}, we construct a de- 
terministic finite automaton D which accepts the same language, subset of X*, as 
follows. 

The set of states of D is 2°, that is, the powerset of Q. 

The initial state of D is {qo}. 

A state of D is final iff it includes a state in F. 

The transition function 7 of the finite automaton D is defined as follows: for every 
S CQ, for every a E€ X, n(S,a) = {p| p € 6(g,a) and q € S}. 


We can keep the set of states of the automaton D as small as possible, by considering 
only those states which are reachable from the initial state {qo}. 


2.4. Left Linear and Right Linear Regular Grammars 


In this section we show that regular languages, which can be generated by right linear 
grammars, can also be generated by left linear grammars as we have anticipated on 
page 14. 

Let us begin by introducing the notions of right linear and left linear grammars 
in a setting where we also allow epsilon productions. 


DEFINITION 2.4.1. [Extended Right Linear Grammars and Extended 
Left Linear Grammars| Given a grammar G = (Vr, Vy, P, S}, (i) we say that 
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G is an extended right linear grammar iff every productions is of the form A — 8 
with A € Vy and @ € {e} U Vr U VrVy, and (ii) we say that G is an extended 
left linear grammar iff every production is of the form A — ( with A € Vy and 
B E {e} U Vr U Vn Vr. 


DEFINITION 2.4.2. |S-extended Right Linear Grammars and S-extended 
Left Linear Grammars] Given a grammar G = (Vr, Vy, P, S}, (i) we say that 
G is an S-extended right linear grammar iff every production is either of the form 
A — 6 with A € Vy and $ € Vr U VrVy, or it is S > e, and (ii) we say that G is 
an S-extended left linear grammar iff every production is either of the form A —> 8 
with A € Vy and 8 € Vr U VnVr, or it is S > €. 


We have the following theorem. 


THEOREM 2.4.3. |Equivalence of Left Linear Extended Grammars and 
Right Linear Extended Grammars] (i) For every extended right linear grammar 
there exists an equivalent, extended left linear grammar. (ii) For every extended left 
linear grammar there exists an equivalent, extended right linear grammar. 


In order to show this Theorem 2.4.3 it is enough to show the following Theorem 2.4.4 
because of the result stated in Theorem 1.5.4 Point (iv) on page 20. 


THEOREM 2.4.4. [Equivalence of Left Linear S-extended Grammars and 

Right Linear S-extended Grammars] (i) For every S-extended right linear 
grammar there exists an equivalent, S-extended left linear grammar. (ii) For every 
S-extended left linear grammar there exists an equivalent, S-extended right linear 
grammar. 
PROOF. (i) Given any S-extended right linear grammar G = (Vr, Vy, P, S), we 
construct the nondeterministic finite automaton M over the alphabet Vr by applying 
Algorithm 2.2.2 on page 34. Thus, the language accepted by M is L(G). Then from 
this automaton M viewed as labeled multigraph, we generate an S-extended left 
linear grammar G” by applying the following procedure for generating the set P’ of 
productions of G” whose alphabet is Vr and whose start symbol is S. The set Vx, of 
nonterminal symbols of G’ consists of the nonterminal symbols occurring in P’. 


ALGORITHM 2.4.5. 


Procedure: from Nondeterministic Finite Automata 
to S-extended Left Linear Grammars. (Version 1) 


Let us consider the labeled multigraph corresponding to a given nondeterministic 
finite automaton M. Let S1,..., Sn be the final states of M. Let the set P’ of 
productions be initially {S — S,..., S > S,}. 
Step (1). For every edge from a state A to a state B with label a € Vr, do the 
following actions 1 and 2: 

1.1. Add to P’ the production B — Aa. 

1.2. If A is the initial state, then add to P’ also the production B —> a. 
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Step (2). If a final state of M is also the initial state of M, then add to P’ the 
production S —> €. 

Step (3). Finally, for each 7, with 1<i<n, unfold S; in the production S — S; (see 
Definition 1.6.4 on page 26), that is, replace S > S; by S — o1|...|om, where 
Si > 01| ... | Om are all the productions for Sj. 


In Step (1) of this procedure we have to look at every state, one at a time, and for 
each state at every incoming edge. The S-extended left linear grammar which is 
generated by this procedure can then be simplified by eliminating useless symbols 
(see Definition 3.5.5 on page 125), if any. 

If in the automaton M there exists one final state only, that is, n = 1, then 
Algorithm 2.4.5 can be simplified by: (i) calling S the final state of M, (ii) assuming 
that the set P’ is initially empty, and (iii) skipping Step (3). 

We leave it to the reader to show that the derived S-extended left linear gram- 
mar G” generates the same language accepted by the given finite automaton M 
which is also the language L(G) generated by the given S-extended right linear 
grammar G. 


Now we present an alternative algorithm for constructing an S-extended left 
linear grammar G from a given nondeterministic finite automaton N. 
Let L be the language accepted by the finite automaton N. 


ALGORITHM 2.4.6. 
Procedure: from Nondeterministic Finite Automata 
to S-extended Left Linear Grammars. (Version 2) 


Step (1). Construct a transition graph T starting from the nondeterministic finite 
automaton N by adding ¢-arcs from the final states of N to a new final state, say qr. 
Then make all states different from qp to be non-final states. 


Step (2). Reverse all arrows and interchange the final state with the initial state of T. 
We have that the resulting transition graph TË whose initial state is qf, accepts the 
language LE = {ap - - a2aı | aao: -ap € L} C VŽ. 

Step (3). Apply Algorithm 2.2.3 on page 34 to the derived transition graph T®?. 
Actually, we apply an extension of that algorithm because in order to cope with the 
arcs labeled by £ which may occur in TĒ, we also apply the following rule: 


if in TË there is an arc labeled by € from state A to state B 
then (i) we add the production A — B, and 
(ii) if B is a final state of TË then we add also the production A — e. 


By doing so, from TË we get an S-extended right linear grammar with the possible 
exception of some productions of the form A — B. 

Note that: (i) gp is the axiom of that S-extended right linear grammar, and (ii) if 
a production of the form A — € occurs in that grammar then A is qf. 


Step (4). In the derived grammar reverse each production, that is, transform each 
production of the form A — aB into a production of the form A — Ba. 
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Step (5). Unfold B in every production A — B (see Definition 1.6.4 on page 26), that 
is, if we have the production A— B and B —> (|... |G, are all the productions 
for B then we replace A— B by A— bil... | Bn. 


The left linear grammar which is generated by this procedure can then be simplified 
by eliminating useless symbols (see Definition 3.5.5 on page 125), if any. 

If in the automaton N there exists one final state only, then Algorithm 2.4.6 can 
be simplified by: (i) skipping Step (1) and calling qy the unique final state of N, 
(ii) adding the production qf — € if gp is both the initial and the final state of TË, 
and (iii) skipping Step (5). 

We leave it to the reader to show that the language generated by the derived 
S-extended left linear grammar with axiom qp, is the language L accepted by the 
finite automaton N. 

In Section 7.7 on page 230 we will present a different algorithm which given any 
nondeterministic finite automaton, derives an equivalent left linear or right linear 
grammar. That algorithm uses techniques (such as the elimination of ¢-productions 
and the elimination of unit productions) for the simplifications of context-free gram- 
mars which we will present in Section 3.5.3 on page 125 and Section 3.5.4 on page 126. 


(ii) Given any S-extended left linear grammar G = (Vr, Vy, P, S}, we construct 
a nondeterministic finite automaton M over Vr by applying the following procedure 
which constructs its transition function 0, its set of states, its set of final states, and 
its initial state. 


ALGORITHM 2.4.7. 


Procedure: from S-extended Left Linear Grammars 
to Nondeterministic Finite Automata. 


Let us consider an S-extended left linear grammar G = (Vr, Vy, P, S}. 
1. The unique final state of the nondeterministic finite automaton M is the state S. 


2. The initial state of the nondeterministic finite automaton M is S if the production 
S — € occurs in P, otherwise it is a new state qo. 


3. For each production of the form A — a we consider the edge labeled by a, from 
node qo to node A. 


4. For each production of the form A — Ba we consider the edge labeled by a, from 
node B to node A. 


The resulting labeled multigraph represents the desired nondeterministic finite au- 
tomaton. This nondeterministic finite automaton may have equivalent states which 
can be eliminated (see Section 2.8). 


Then from this nondeterministic finite automaton M, we construct an equivalent 
S-extended right linear grammar G” by applying Algorithm 2.2.3 on page 34. 

We leave it to the reader to show that: (i) the language generated by the given 
S-extended left linear grammar G is equal to the language accepted by the given 
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finite automaton M, and (ii) the language accepted by M is equal to the language 
generated by the S-extended right linear grammar G”. 


EXAMPLE 2.4.8. Let us consider the nondeterministic finite automaton depicted 
in Figure 2.4.1. The left linear grammar with axiom S which accepts the same 
language, has the following productions: 


S — Aa |a 
A > Aa | Bb |a 
B — Ba | Ab |b 


FIGURE 2.4.1. A nondeterministic finite automaton which accepts 
the language generated by the left linear grammar of Example 2.4.8. 


EXAMPLE 2.4.9. Let us consider the left linear grammar with the following pro- 
ductions (see Example 2.4.8): 

S — Aa |a 

A — Aa | Bbļ|a 

B — Ba | Ab |b 
If we apply Procedure 2.4.7 we get the nondeterministic finite automaton of Fig- 
ure 2.4.2. In Section 2.3 we have presented the Powerset Construction Procedure for 
generating a deterministic finite automaton which accepts the same language of a 
given nondeterministic finite automaton, and in Section 2.8 we will present a proce- 
dure for determining whether or not two deterministic finite automata are equivalent. 
We leave it as an exercise to the reader to prove that, by applying those procedures, 
the nondeterministic finite automaton of Figure 2.4.1 accepts the same language 
which is accepted by the nondeterministic finite automaton of Figure 2.4.2. 


REMARK 2.4.10. The following two observations may help the reader to realize 
the correctness of Algorithm 2.2.2 (on page 34) and Algorithm 2.2.3 (on page 34) pre- 
sented in the proof of Theorem 2.2.1 (on page 33), and Algorithms 2.4.5, 2.4.6, and 
2.4.7 (on page 40, 41, and 42, respectively) presented in the proof of Theorem 2.4.4 
(on page 40): 

(i) in the right linear grammars every nonterminal symbol A corresponds to a 
state q4 which represents the set S4 of words such that for every word w € SA 
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FIGURE 2.4.2. The nondeterministic finite automaton obtained from 
the left linear grammar of Example 2.4.9 by applying Procedure 2.4.7. 
States qo and A are equivalent. 


there exists a w-path from q4 to a final state, that is, the language generated by the 
nonterminal symbol A (see Definition 1.2.4 on page 11), and 


(ii) in the left linear grammars every nonterminal symbol A corresponds to a state q4 
which represents the set S4 of words such that for every word w € S4 there exists 
a w-path from the initial state to qa. 

Thus, we can say that: 
(i) in the right linear grammars every state encodes its future until a final state, and 
(ii) in the left linear grammars every state encodes its past from the initial state. O 


EXERCISE 2.4.11. (i) Construct the right linear grammar equivalent to the left 
linear grammar Gz, whose axiom is S and whose productions are: 


S — Ab B — Ba 
A — Ba B —a 
A —a 


(ii) Construct the left linear grammar equivalent to the right linear grammar Gp, 
whose axiom is S and whose productions are: 


S —aA B —>aA 
S —aB B-aB 
A —b O 


2.5. Finite Automata and Regular Expressions 


In this section we prove a theorem due to Kleene which establishes the correspon- 
dence between finite automata and regular expressions. In order to state the Kleene 
Theorem we need the following definitions. 


DEFINITION 2.5.1. [Regular Expression] A regular expression over an alphabet 
X is an expression e of the form: 


e:= Ola e,*e2 |e, +e | e* 
for any a € X. 
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Sometimes the concatenation e1 " ez is simply written as e;e2. The regular expression 
()* will also be denoted by e. 

The reader should notice the overloading of the symbols in X. Indeed, each 
symbol of © may also be a regular expression. Analogously, € denotes the empty 
word and also the regular expression *. 

The set of regular expressions over © is denoted by REzpry, or simply RExpr, 
when © is understood from the context. 


DEFINITION 2.5.2. [Language Denoted by a Regular Expression] A regular 
expression e over the alphabet © denotes a language L(e) C &* which is defined by 
the following rules: 


(i) L@) = 9, 
(ii) for any a € X, L(a) = {a}, 
(iii) L(e1 = e2) = L(e1) + L(e2), where on the left hand side ‘ + ’ denotes concatenation 


of regular expressions, and on the right hand side ‘+’ denotes concatenation of 
languages as defined in Section 1.1, 


(iv) L(e1 + e2) = L(e1) U L(e2), and 
(v) L(e*) = (L(e))*, where on the right hand side ‘*’ denotes the operation on 
languages which is defined in Section 1.1. 


The set of languages denoted by the regular expressions over X is called LREspr, 5 
or simply Lrezpr, When & is understood from the context. We will prove that 
Lrgezpr,d is equal to REG, that is, the class of all regular languages subsets of b*. 


We have that L(e) = {e}. Since {e} is the neutral element of language concate- 
nation, we also have that L(e +e) = L(e » £) = L(e). 


DEFINITION 2.5.3. [Equivalence Between Regular Expressions| Two reg- 
ular expressions eı and ez are said to be equivalent, and we write e; = e2, iff they 
denote the same language, that is, L(e,) = L(e2). 


In Section 2.7 we will present an axiomatization of all the equivalences between 
regular expressions. 

In the following definition we generalize the notion of transition graph given in 
Definition 2.3.1 by allowing the labels of the edges to be regular expressions, rather 
than elements of X U {e}. 


DEFINITION 2.5.4. [RExpr Transition Graph] An REzpry transition graph 
(Q, ©, qo, F,6) over the alphabet X is a multigraph like that of a nondeterministic 
finite automaton over X, except that the transition function ô is a total function from 
Qx REzprys to 2° such that for any q € Q, q € ô(q, £). When © is understood from 
the context, we will write ‘RExpr transition graph’, instead of ‘RExpry transition 
graph’. 


Similarly to a transition graph, an REzpr transition graph over X can be depicted 
as a labeled multigraph. The edges of that multigraph are labeled by regular ex- 
pressions over &. 
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DEFINITION 2.5.5. [Image of a Set of States with respect to a Regular 
Expression] For every subset S of Q and every e € REzpry, the e-image of S is 
the smallest subset of Q which includes every state g,41 such that: 

(i) there exists a word w € L(e) which is the concatenation of the n(> 1) words 
Wi, .-., Wn OL, 

(ii) there exists a sequence of edges (q1, q2), (q2, 93); «+++ (dn, qn+1) such that qı € S, 
and 

(iii) for i = 1,...,n, the word w; belongs to the language denoted by the regular 
expression which is the label of (qi, qi+1)- 


Based on this definition, for every RExpr transition graph with transition function 
ô we define a total function 6* from 2°x RErxpr to 2@ as follows: 


for every set A C Q and e € REzpr, 6*(A,e) is the e-image of A. 


Given a REzpr transition graph, we say that there is a w-path from state p to state 
q for some word w € &* iff q € 6*({p}, w). Thus, given a subset A of Q and a regular 
expression e over 1, ô*(A, e) is the set of all states q such that there exists a w-path 
with w € L(e), from a state in A to q. 


DEFINITION 2.5.6. [Language Accepted by an REzpr Transition Graph] 
An RExpr transition graph (Q, ÈX, qo, F,6) accepts a word w in %* iff there exists 
a state in 6*({qo},w) which belongs to F. An RExpr transition graph accepts a 
language L iff it accepts every word in L and no other word. If an RExpr transition 
graph T accepts a language L, we say that L is the language accepted by T. 


When introducing the concepts of this definition, other textbooks use the terms ‘rec- 

ognizes’ and ‘recognized’, instead of the terms ‘accepts’ and ‘accepted’, respectively. 
We will prove that the set of languages accepted by the RExpr transition graphs 

over X is equal to REG, that is, the class of all regular languages subsets of *. 


DEFINITION 2.5.7. [Equivalence Between RExpr Transition Graphs] Two 
RExpr transition graphs are said to be equivalent iff they accept the same language. 


REMARK 2.5.8. As for transition graphs, one may assume that the transition 
functions of the RExpr transition graphs are partial functions, rather than total 
functions. Indeed, by using the sink state technique one can show that for every 
RExpr transition graph with a partial transition function, there exists an RExpr 
transition graph with a total transition function which accepts the same language, 
and vice versa. o 


DEFINITION 2.5.9. [Equivalence Between Regular Expressions, Finite 
Automata, Transition Graphs, and REzpr Transition Graphs] (i) A regular 
expression and a finite automaton (or a transition graph, or an RExpr transition 
graph) are said to be equivalent iff the language denoted by the regular expression 
is the language accepted by the finite automaton (or the transition graph, or the 
RExpr transition graph, respectively). 
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Analogous definitions will be assumed for the notions of: (ii) the equivalence 
between finite automata and transition graphs (or RExpr transition graphs), and 
(iii) the equivalence between transition graphs and RExzpr transition graphs. 


Now we can state and prove the following theorem due to Kleene. 


THEOREM 2.5.10. [Kleene Theorem] (i) For every deterministic finite automa- 
ton D over the alphabet © there exists an equivalent regular expression over X, that 
is, a regular expression which denotes the language accepted by D. 

(ii) For every regular expression e over the alphabet X there exists an equivalent 
deterministic finite automaton over X}, that is, a finite automaton which accepts the 
language denoted by e. 


PROOF. (i) Let us consider a finite automaton (Q, X, qo, F,6). Obviously, any finite 
automaton over © is a particular instance of an RExpry transition graph over X. 

Then we apply the following algorithm which generates an REzpry transition 
graph consisting of the two states qin and f and one edge from qin to f labeled by a 
regular expression e. The reader may convince himself that the language accepted 
by the given finite automaton is equal to the language denoted by e. 


ALGORITHM 2.5.11. 
Procedure: from Finite Automata to Regular Expressions. 


Step (1). Introduction of ¢-edges. 

(1.1) We add a new, initial state qin and an edge from qin to qo labeled by €. Let gin 
be the new unique, initial state. 

(1.2) We add a single, new final state f and an edge from every element of F to f 
labeled by £. Let {f} be the new set of final states. 


Step (2). Node Elimination. 
For every node k different from qin and f, apply the following procedure: 
Let (p1,k),..-, (Pm, k} be all the edges incoming to k and starting from nodes distinct 


from k. Let the associated labels be the regular expressions 71, ..., Z, respectively. 
Let (k,qi),---, (k,n) be all the edges outgoing from k and arriving at nodes distinct 
from k. Let the associated labels be the regular expressions 21,..., Zn, respectively. 


Let the labels associated with the s (>0) edges from k to k be the regular expressions 
Yi,- --, Ys, respectively. 

We eliminate: (i) the node k, (ii) the m edges (pı, k}, ..., (Pm, k}, Gii) the n 
edges (k,q1),---, (k, qn), and (iv) the s edges from k to k. 

We add every edge of the form (p;,q;), with 1 < i < m and 1 < j < n, with 
label x; (y1 +... + Ys)” zj. 

We replace every set of n (> 2) edges, all outgoing from the same node, say h, 
and all incoming to the same node, say k, whose labels are the regular expressions 
€1,€2,---,€n, respectively, by a unique edge (h, k) with label e1 + e2 +... + en. 


(ii) Given a regular expression e over }Ł, we construct a finite automaton D which 
accepts the language denoted by e by performing the following two steps. 
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Step (ii.1). From the given regular expression e, we construct a new transition 
graph T by applying the following algorithm defined by structural induction on e. 


ALGORITHM 2.5.12. 


Procedure: from Regular Expressions to Transition Graphs (see also Figure 2.5.1). 


From the given regular expression e, we first construct an RExpry transition graph G 
with two states only: qin and f, qin being the only initial state and f being the only 
final state. Let G have the unique edge (qin, f} labeled by e. Then, we construct 
the transition graph T by performing as long as possible the following actions. 


If e = É then we erase the edge. 

If e =a for some a € &, then we do nothing. 

If e = e " e2 then we replace the edge, say (a,b), with associated label e1 + e2, by 
the two edges (a, k} and (k,b) for some new node k, with associated labels e; and 
€2, respectively. 

If e = e; + ez then we replace the edge, say (a,b), with associated label e; + e2, by 
the two edges (a,b) and (a,b) with associated labels e and ez, respectively. 

If e = eï then we replace the edge, say (a,b), with associated label eï, by the three 
edges (a, k), (k,k), and (k,b), for some new node k, with associated labels €, e1, 
and €, respectively. 


ge? as => fe) fo) 
a a 
o——>o = o———>o 
€, = €2 ey €2 
= o—o— o 
e 
ate, = a 
€2 


e 
e* TOF 
O—————0 => O———— O 


FIGURE 2.5.1. From Regular Expressions to Transition Graphs. a is 
any symbol in X. 


Step (ii.2). From the transition graph T we generate the finite automaton D which 
accepts the same language accepted by T, by applying the Powerset Construction 
Procedure (see Algorithm 2.3.8 on page 37). 

The reader may convince himself that the language denoted by the given regular 
expression e is equal to the language accepted by T and, by Theorem 2.3.7, also to 
the language accepted by D. 
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In the proof of Kleene Theorem above, we have given two algorithms: (i) a first 
one (Algorithm 2.5.11 on page 47) for constructing a regular expression equivalent 
to a given finite automaton, and (ii) a second one (Algorithm 2.5.12 on page 48) for 
constructing a finite automaton equivalent to a given regular expression. 

These two algorithms are not the most efficient ones, and indeed, more efficient 
algorithms can be found in the literature (see, for instance, [5]). 


Figure 2.5.2 on page 50 illustrates the equivalence of finite automata, transition 
graphs, and regular expressions as stated by the Kleene Theorem. In that figure 
we have also indicated the algorithms which provide a constructive proof of that 
equivalence. 


EXERCISE 2.5.13. Show that, in order to allow a simpler application of the 
Powerset Construction Procedure, one can simplify the transition graph obtained 
by Algorithm 2.5.12 on page 48, by applying to that graph the following graph 
rewriting rules, each of which (i) deletes one node and (ii) replaces three edges by 
two edges: 


R1. ERE => Qe, 


A Ke B A B 
e e 
R2. METON => o£ Ay 
A -~ B A B 


Rule R1 is applied if no other edges, besides the one labeled by £, departs from 
node A. Rule R2 is applied if no other edges, besides the one edge labeled by e, 
arrives at node B. Crossed dashed edges denote these conditions. The transition 
graphs obtained from the regular expressions ab* + b and a* + b* show that these 
conditions are actually needed in the sense that, if we apply in those transition 
graphs rules R1 and R2, we do not preserve the languages that they accept. 


EXAMPLE 2.5.14. [From a Finite Automaton to an Equivalent Regular 
Expression] Given the finite automaton of Figure 2.3.1 on page 39, we want to 
construct the regular expression which denotes the language accepted by that finite 
automaton. We apply Algorithm 2.5.11 on page 47. 


After Step (1) of that algorithm we get the transition graph: 
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Introduction of -edges 
— and Node Elimination — 
Graphs (Algorithm 2.5.11 on page 47) Expressions 


Transition Regular 


f Structural Induction 
T obvious | ( 


Algorithm 2.5.12 on page 48) 


Finite Powerset Construction Transition 
Automata | <— (Algorithm 2.3.8 on page 37) <— Graphs 


FIGURE 2.5.2. A pictorial view of the Kleene Theorem: equivalence 
of finite automata, transition graphs, and regular expressions. 


Then by eliminating node 1 in the transition graph T (see subgraph S1 below), we 
get the transition graph T1: 


S1: 
A 


Then by eliminating node 2 in the transition graph T1 (see subgraph S2 below), we 
get the transition graph T2: 


on 
© 


Then by eliminating node 12 in the transition graph T2 (see subgraph S3 below), 
we get the transition graph: 
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Then by eliminating node 123 in the transition graph T3 (see subgraph S4 below), 
we get the transition graph T4: 


S4: T4: 
1+01 


ooien QO -© O 


1(1+01)*(0+00+€) 


Thus, the resulting regular expression is: 0 + 1(1 + 01)*(0 + 00 + €). 


EXAMPLE 2.5.15. [From a Regular Expression to an Equivalent Finite 
Automaton] Given the regular expression: 


0 + 1(1 +01)*(0 +00 +£), 


we want to construct the finite automaton which accepts the language denoted by 
that regular expression. We can do so into the following two steps: 


(i) we construct a transition graph T which accepts the same language denoted by 
the given regular expression by applying Algorithm 2.5.12 on page 48, and then 


(ii) from T by using the Powerset Construction Procedure (see Algorithm 2.3.8 on 
page 37), we get a finite automaton which is equivalent to T. 


The transition graph T equivalent to the given regular expression is depicted in 
Figure 2.5.3 on page 52. 

By applying the Powerset Construction Procedure we get a finite automaton A 
whose transition function is given in the following table where the final states have 
been underlined. 
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FIGURE 2.5.3. The transition graph T corresponding to the regular 
expression 0 + 1(1 + 01)*(0 + 00 +€). 


input 
state 0 1 


Transition function of 
the finite automaton A: 


The initial state of the finite automaton A is 1 (note that in the transition graph T 
there are no edges labeled by £, outgoing from state 1). All states, except state 1, 
are final states (and thus, we have underlined them) because they all include state 7 
which is a final state in the transition graph T. (Recall that a state s of the finite 
automaton A should be final iff it includes a state from which in the transition 
graph T there is an ¢-path to a final state of T and, in particular, if s includes a 
final state of the transition graph.) 

The entries of the above table are computed as stated by the Powerset Construc- 
tion Procedure. For instance, from state 1 for input 1 we get to state 2357 because 
in T: 

i) there is an edge from state 1 to state 2 labeled 1, 

(ii) there is an (1e€)-path (that is, an 1-path) from state 1 to state 3, 
(iii) there is an (1€€)-path (that is, an 1-path) from state 1 to state 5, 
(iv) there is an (leee)-path (that is, an 1-path) from state 1 to state 7, and 
(v) no other states in T are reachable from state 1 by an 1-path. 


Likewise, from state 2357 for input 1 we get to state 357, because in T there is 
the following transition subgraph: 
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As we will see in Section 2.8, the states 2357 and 357 are equivalent and we get a 
minimal automaton M (see Figure 2.5.4) whose transition function is represented 
in the following table. 


input 
state 0 1 
Transition function of 


the minimal finite 
automaton M: 


In this table, according to our conventions, we have underlined the three final 
states 7, 357, and 467. 


FIGURE 2.5.4. The minimal finite automaton M corresponding to 
the transition graph T of Figure 2.5.3. 


As expected, the finite automaton M of Figure 2.5.4 is isomorphic to the one of 
Example 2.5.14 depicted in Figure 2.3.1 on page 39. 


We have the following theorem. 


THEOREM 2.5.16. [Equivalence Between Regular Languages and Regu- 
lar Expressions] A language is a regular language iff it is denoted by a regular 
expression. 


PROOF. By Theorem 2.2.1 we have that every regular language corresponds to a 
nondeterministic finite automaton, and by Theorem 2.1.14 we have that every non- 
deterministic finite automaton corresponds to deterministic finite automaton. By 
Theorem 2.5.10 we also have that the set Dp, of languages accepted by deterministic 
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finite automata over an alphabet © is equal to the set Lepr», of languages denoted 
by regular expressions over X. 0O 


As a consequence of this theorem and Kleene Theorem (see Theorem 2.5.10 on 
page 47), we have that there exists an equivalence between 

(i) regular expressions, 

(ii) finite automata, and 

(iii) S-extended regular grammars. 


THEOREM 2.5.17. [The Boolean Algebra of the Regular Languages| The 
set of languages accepted by finite automata (and thus, the set of languages denoted 
by regular expressions and also the set of regular languages) is a boolean algebra. 


PROOF. We first show that the set of languages accepted by finite automata is 
closed under: (i) complementation with respect to &*, and (ii) intersection. 


(i) Let us consider a finite automaton A over the alphabet © which accepts the 
language L4. We want to construct a finite automaton A which accepts ©* — Ly. 

In order to do so, we first add to the finite automaton A a sink state s which is 
not final for A. Then for each state q and label a in © such that there is no outgoing 
edge from q with label a, we add a new edge from q to s with label a. By doing 
so the transition function of the derived augmented finite automaton is guaranteed 
to be a total function. Finally, we get the automaton A by interchanging the final 
states with non-final ones. 


(ii) The finite automaton C which accepts the intersection of the language Ly ac- 
cepted by a finite automaton A (over the alphabet X) and the language Lp accepted 
by a finite automaton B (over the alphabet £), is constructed as follows. 

The states of C are the elements of the cartesian product of the set of states of 
A and B. For every a € X we stipulate that 6((q,q;),@) = (dn, qn) iff 6(qi,a) = gn 
for the automaton A and 6(q;,a) = q, for the automaton B. The final states of the 
automaton Č are of the form (qr, qs), where qr is a final state of A and qs is a final 
state of B. 

The element 1 of the boolean algebra is the language * and the element 0 is 
the empty language, that is, the language with no words. 

We leave it to the reader to check that the various axioms of the boolean algebra 
are valid, that is, for every language x, y, and z C &*, the following properties hold: 


1.1 rcUy=yUrxr 1.2 ae aA Oe 

2.1 Te Wy Oe = (Ene) U(yNz) 2.2 (x Ny) Uz = (ruz) N (yuz) 
3.1 epg 3.2 cNAOT= 

AL gui = a 4.2 £ =e 

Be We ee 


where % denotes &* — x. All these properties are obvious because the operations on 
languages are set theoretic operations. 0 
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In the following definition we introduce the notion of an automaton, called the 
complement automaton, which for any given alphabet X; and automaton A, accepts 
the language U} — L(A). 

DEFINITION 2.5.18. [Complement of a Finite Automaton] Given a finite 
automaton A over the alphabet © the complement automaton A of A with respect 
to any given alphabet ©), is a finite automaton over the alphabet X; such that A 
accepts the language L(A) = Xj — L(A). 

Now we present a procedure for constructing, for any given finite automaton over 
an alphabet X, the complement finite automaton with respect to an alphabet 4}. 
This procedure generalizes the one presented in the proof of Theorem 2.5.17 above. 


ALGORITHM 2.5.19. 
Procedure: Construction of the Complement Finite Automaton with respect to an 


Alphabet X. 


We are given a finite automaton A over the alphabet ©. We construct the comple- 
ment automaton A with respect to the alphabet X; as follows. 

We first add to the automaton A a sink state s which is not final for A. Then for 
each state q (including the sink state s) and label a € X; such that there is no 
outgoing edge from q with label a, we add a new edge from q to s with label a. 
Then in the derived automaton we erase all edges labeled by the elements in X — }4. 
Finally, we get the complement automaton A by interchanging the final states with 
the non-final ones. 


In the following definition we introduce the extended regular expressions over 
an alphabet X. They are defined to be the regular expressions over X% where we 
also allow: (i) the complementation operation, denoted 7, and (ii) the intersection 
operation, denoted ^. 


DEFINITION 2.5.20. [Extended Regular Expressions] An extended regular 
expressions over an alphabet © is an expression e of the form: 


e := Qlalerreleteale|elea Ae 
where a ranges over the alphabet ©. 


DEFINITION 2.5.21. [Language Denoted by an Extended Regular Expres- 
sion| The language L(e) C &* denoted by an extended regular expression e over the 
alphabet X is defined by structural induction as follows (see also Definition 2.5.2): 


L(0) =0 

L(a) = {a} for any a € X 
Ley + e2) = L(e1) + L(e2) 
L(e, + e2) = L(e1) U L(e2) 
L(e*) = (L(e))* 
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toss = Le) 
Ley A e2) = L(e1) N L(e2) 


Extended regular expressions are equivalent to regular expressions because regular 
expressions are closed under complementation and intersection. 


There exists an algorithm which requires O((|w|+|e|)*) units of time to determine 
whether or not a word w of length |w| is in the language denoted by a regular 
expression e of length |e| [9, page 76]. 


2.6. Arden Rule 


Let us consider the equation r = sr + t among regular expressions in the unknown 
r. We look for a solution of that equation, that is, a regular expression 7 such that 
L(T) = L(s)+ L(T) U L(t), where + denotes the concatenation of languages and it 
is defined in Section 1.1. 


We have the following theorem. 


THEOREM 2.6.1. Given the equation r = sr + t in the unknown r, its least 
solution is s*t, that is, for any other solution z we have that L(s*t) C L(z). If 
e€ Æ L(s) then s*t is the unique solution. 


PROOF. We divide the proof in the following three Points a, 3, and y. 

Point (a). Let us first show that s*t is a solution for r of the equation r = sr + t, 
that is, s*t = ss*t +t. 

In order to show Point (œ) now we show that: (a.1) L(s*t) C L(ss*t) U L(t), and 
(a.2) L(ss*t) U L(t) C L(s*t). 

Proof of (a.1). Since L(s*t) = U;>o L(s't) we have to show that for each i > 0, 
L(s't) C L(ss*t) U L(t), and this is immediate by induction on i because L(ss*t) = 
U;so hist 

Proof of (a.2). Obvious because we have that L(ss*t) C L(s*t) and L(t) C L(s*t). 
Point (3). Now we show that s*t is the minimal solution for r of r = sr + t. 

We assume that z is a solution of r = sr +t, that is, z = sz + t. We have to 
show that L(s*t) C L(z), that is, U,., L(s't) C L(z). The proof can be done by 
induction on t > 0. 

(Basis: 1=0) L(t) C L(z) holds because z = sz + t. 

(Step: i>0) We assume that L(s’t) C L(z) and we have to show that L(s‘t!t) C 
L(z). This can be done as follows. From L(s‘t) C L(z) we get L(s’*!t) C L(sz). 
We also have that L(sz) C L(z) because z = sz +t and thus, by transitivity, 
L(+) C Le). 

Point (y). Finally, we show that if € ¢ L(s) then s*t is the unique solution for 
r ofr = sr +t. Let us assume that there is a different solution z. Since z is a 
solution we have that z = sz + t. By Point (3) L(s*t) C L(z). Thus, we have that 
L(z) = L(s*t) U A, for some A such that: AN L(s*t) =@ and AF Ó. 


i>0 
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Since z is a solution we have that L(s*t) U A = L/(s) (L(s*t) U A) U L(t). Now 
L(s) (L(s*t)UA)UL(t) = L(ss*t)UL(s)AUL(t) = L(s*t)UL(s)A. Thus, L(s*t) U A = 
L(s*t) U L(s)A. From this equality we get: A C L(s)A because we have that 
AN L(s*t) = Ø. However, A C L(s)A is a contradiction as we now show. Indeed, let 
us take the shortest word, say x, in A. If € ¢ L(s) then the shortest word in L(s)A 
is strictly longer than zx. 


Analogously to Theorem 2.6.1, we also have that given the equation r = rs +t in the 
unknown r, its least solution is ts*, and if € ¢ L(s) then ts* is the unique solution 
of the equation r = rs + t. 


2.7. Equations Between Regular Expressions 


Let us consider the alphabet © and the set RExpry of regular expressions over X. 
An equation between regular expressions is an expression of the form x = y, where 
the variables x and y range over the elements of RExpry. 

Now we present an axiomatization of the equations between regular expressions 
in the sense any equation holding between two regular expressions can be derived 
from the azioms and the derivation rules which we now introduce. The axioms 
are equations between regular expressions, and the derivation rules are rules which 
allow us to derive new equations from old equations. 


The set of axioms is infinite and, in order to present all the axioms, we will write 
them as schematic axioms. A schematic axiom stands for all the axioms which can 
be derived by replacing each variable occurring in the schematic axiom by a regular 
expression in RExpry. Also the set of derivation rules is infinite and we will present 
them as schematic derivation rules. 


Here is an axiomatization, call it F’, of the equations between regular expressions 
given by schematic axioms and schematic derivation rules. First, we list the following 
schematic axioms Al—A11, where the variables x,y, and z are implicitly universally 
quantified and range over regular expressions in RExpry. 


Al. a#+(yt+z)=(¢+y)+2 
A2. x(yz)=(xy)z 
A3. et+y=yts 


A4. xļly+z)=xry+rz 


A5. (y+z)z=yz+zzr 
A6. £+2=2 

AT. eax 

A8. Øxr=0 

A9. «+0=2 

A10. x* = * + x*z 

All. x* =(@*+2)* 
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The schematic derivation rules of F are the following ones: 
R1. (Substitutivity) if yı = yə then zı = zı [y1/yp| 


where zılyı/y2] denotes the expression xı where every occurrence of yz has been 
replaced by yı. 


R2. (Arden rule) ife ¢ L(x) and z = zy + z then x = zy". 


As usual, the equality relation = is assumed to be reflexive, symmetric, and transi- 
tive. 


Given the axiomatization F of regular expression, an equation « = y between 
the regular expression x and y, is said to be derivable in F iff it can be derived as 
the last equation of a sequence of equations each of which is: (i) either an instance 
of an axiom, or (ii) it can be derived by applying a derivation rule from previous 
equations in the sequence. 

An equation x = y is said to be valid iff L(x) = L(y). Thus, x = y is a valid 
equation iff the regular expressions x and y are equivalent, that is, L(x) = L(y) (see 
Definition 2.5.3 on page 45). 

An axiomatization is said to be consistent iff all equations derivable in that 
axiomatization are valid. 

An axiomatization is said to be complete iff all valid equations are derivable in 
that axiomatization. 


THEOREM 2.7.1. [Salomaa Theorem for Regular Expressions] The axiom- 
atization F is consistent and complete. 


One can show (see [17]) that no axiomatization of equations between regular 
expressions can be done in a purely equational form (like, for instance, the schematic 
axioms Al—A11), but one needs schematic axioms or derivation rules which are not 
in an equational form (like, for instance, the schematic derivation rule R2). 


EXERCISE 2.7.2. Show that: x0 = 0. 
Solution. xý = {by A8} = x (00) = {by A9} = 200+0. From zø = (x0) 0+90 by 
R2 we get: xý =00*. From zØ = 00* by A8 we get: x = 0. 


Note that the round brackets used in the expression (x Ø) @ are only for reasons 
of readability. Indeed, they are not necessary because the concatenation operation, 
which we here denote by juxtaposition, is associative. 


EXERCISE 2.7.3. Show that: «@* = x. 


Solution. x = {by A9} = x +0 = {by Exercise 2.7.2} = x +x ý = {by A3} = r ý+ z. 
From z = z Ý + by R2 we get: z = z 0*. 


Given two regular expressions e, and e2, one can check whether or not e; = e2 
by: (i) constructing the corresponding minimal finite automata (see the following 
Section 2.8), and then (ii) checking whether or not these minimal finite automata 
are isomorphic according to the following definition. 
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DEFINITION 2.7.4. [Isomorphism Between Finite Automata] Two finite 
automata are isomorphic iff they differ only by: (i) a bijective relabelling of the 
states, and (ii) the addition and the removal of the sink states and the edges from 
and to the sink states. 


2.8. Minimization of Finite Automata 


In this section we present the Myhill-Nerode Theorem which expresses an important 
property of the language accepted by any given finite automaton. We then present 
the Moore Theorem and two algorithms for the minimization of the number of states 
of the finite automata. Throughout this section we assume a fixed input alphabet X. 


DEFINITION 2.8.1. [Refinement of a Relation] Given two equivalence rela- 
tions A and B subsets of S x S we say that A is a refinement of B or B is refined 
by A iff for all x,y € S we have that if x Ay then z By. 


DEFINITION 2.8.2. [Right Invariant Equivalence Relation] An equivalence 
relation R over the set &* is said to be right invariant iff x Ry implies that for all 
z E€ &* we have that xz R yz. 


DEFINITION 2.8.3. An equivalence relation R over a set S is said to be of finite 
index iff the partition induced by R is made out of a finite number of equivalence 
classes, also called blocks, that is, S = |J;er Si where: (i) I is a finite set, (ii) for 
each i € I, block S; is a subset of S, and (iii) for for each i,j € I, if i A j then 


THEOREM 2.8.4. [Myhill-Nerode Theorem] Given an alphabet X, the follow- 
ing three statements are equivalent, that is, (i) iff (ii) iff (iii). 
(i) There exists a finite automaton A over the alphabet © which accepts the language 
LGx. 
(ii) There exists an equivalence relation R4 over * such that: (iil) Ra is right 
invariant, (ii.2) R4 is of finite index, and (ii.3) the language L is the union of some 
equivalence classes of R4 (as we will see from the proof, these equivalence classes 
are associated with the final states of the automaton A). 
(iii) Let us consider the equivalence relation Rz over U* defined as follows: for any 
x and yin ©*, x Rzy iff (for all z € X*, xz € L iff yz € L). Rz is of finite index. 
PROOF. We will prove that (i) implies (ii), (ii) implies (iii), and (iii) implies (i). 
Proof of: (i) implies (ii). Let L be accepted by a deterministic finite automaton 
with initial state go and total transition function 6. Let us consider the equivalence 
relation Ra defined as follows: 


for all x,y € &*, x Ray iff 6*(qo, £) = ô (qo, y). 
(ii.1) We show that R4 is right invariant. Indeed, let us assume that for all x,y € X*, 
ô* (qo, £) = ô* (qo, y). Thus, for all z € &*, we have: 
ô* (qo, £z) = 6*(d*(qo, £), Z) (by definition of ô*) 


= ô*(ð* (qo, y), 2) (by hypothesis) 
= 6*(qo, yz) (by definition of 6*) 
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Now 6*(qo, £z) = 6*(qo, yz) implies that xz Rayz. 

(ii.2) We show that R4 is of finite index. Indeed, assume the contrary. Since 
two different words x) and x, of %* are in the same equivalence class of Ra iff 
ô*(qo, zo) = 6*(qo, x1), we have that if Ry has infinite index then there exists an 
infinite sequence (x; |i > 0 and x; € U*) of words such that the elements of the 
infinite sequence (ô*(qo, £o), 0*(qo, 1), ô” (qo, £2), ...) are all pairwise distinct. This 
is impossible because for every i > 0, 6*(qo,2;) belongs to the set of states of the 
finite automaton A and A has a finite set of states. 

(ii.3) We show that L is the union of some equivalence classes of R4. Indeed, assume 
the contrary, that is, the language L is not the union of some equivalence classes 
of Ra. Thus, there exist two words x and y in &* such that x Ray and x € L and 
y ¢ L. By x Ray we get 6*(q, £) = 6*(q0,y). Since x E€ L we have that 6*(q, x) 
is a final state of the automaton A while ô*(qo, y) is not a final state of A because 
y € L. This is a contradiction. 


Proof of: (ii) implies (iii). We first show that R4 is a refinement of Rz. Indeed, 

for all x,y € &*, (x Ray) implies that (for all z € b*, xz Ra yz) 
because R4 is right invariant. Since L is the union of some equivalence classes of 
Ra we also have that: 


for all x,y € &*, 
(for all z € &*, xz Ra yz) implies that (for all z € &*, az € L iff yz € L). 


Then, by definition of Rz, we have that: 

for all x,y € &*, (x Ry y) iff (for all z € &*, xz € L iff yz € L). 
Thus, we get that for all x,y € ©X*, x Ray implies that x Rzy, that is, Ry is a 
refinement of Rz. Since R4 is of finite index also Ry is of finite index. 
Proof of: (iii) implies (i). First we show that the equivalence relation Rz over X* is 
right invariant. We have to show that 

for every x,y E€ X*, x Rzy implies that for all z € &*, xz Rg yz. 
Thus, by definition of Rz, we have to show that 

for every x,y E€ &*, x Ry implies that for all z,w € X*, rzw € L if yzw E€ L. 
This is true because 

for all z, w € X*, rzw € L iff yzw € L 
is equivalent to 

for all z € X*, xz € Liffyze L 
and, by definition of Rz, this last formula is equivalent to x Rz y. 


Now, starting from the given relation Rz, we will define a finite automaton 
(Q, È, qo, F, ô) and we will show that it accepts the language L. In what follows for 
every w € b* we denote by [w] the equivalence class of Rg to which the word w 
belongs. 

Let Q be the set of the equivalence classes of Rz. Since Rz is of finite index, the 
set Q is finite. Let the initial state go be the equivalence class [e] and the set F of 
final states be {[w]|w € L}. 
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For every w € b* and for every v € &, we define d([w], v) to be the equivalence 
class [wv]. This definition of the transition function 6 is well-formed because for 
every word w1, w2 € [w] we have that: 

for every v € X, 6([wi], v) = 6([wo], v), that is, for every v € X, [wiv] = [wou]. 
This can be shown as follows. Since Rz is right invariant we have that: 

Vw, We E€ d*, if wy Rz we then (Vu € X, wiv Ry wav) 
that is, 

Vwi, w2 € &*, if [w1] = [w2] then (Vu € £, [wiv] = [wev}). 

Now the finite automaton M = (Q, £, qo, F, ô} accepts the language L. Indeed, take 
a word w € &*. We have that 6*(qo, w) € F iff 6*(e], w) € F iff [ew] € F iff [w] € F 
iff w € L. 


Note that the equivalence relation R4 has at most as many equivalence classes 
as the states of the given finite automaton A. The fact that R4 may have less 
equivalence classes is shown by the finite automaton M over the alphabet © = {a,b} 
depicted in Figure 2.8.1. The automaton M has two states, while Rm has one 
equivalence class only which is the whole set *, that is, for every x,y € *, we have 


that xz Ry y. 
ee G? 
2 


FIGURE 2.8.1. A deterministic finite automaton M with two states 
over the alphabet © = {a,b}. The equivalence relation Ry is X* x X*. 


Theorem 2.8.4 is actually due to Nerode. Myhill in [12] proved the following Theo- 
rem 2.8.7. 


DEFINITION 2.8.5. [Congruence over /*| A binary equivalence relation R over 
X* is said to be a congruence iff for all x, y, 21, z2 € &*, if x Ry then 2x2 R 21922. 


DEFINITION 2.8.6. A language L C %* induces a congruence Cy, over X* defined 
as follows: Vz,y € &*, x Cry iff (Vz, 22 E€ X*, z1£22 € L iff zy22 € L). 


THEOREM 2.8.7. [Myhill Theorem| L C X* is a regular language iff L is the 
union of some equivalence classes of a congruence relation of finite index over X* iff 
the congruence Cr induced by L is of finite index. 


The following theorem allows us to check whether or not two given finite au- 
tomata are equivalent, that is, they accept the same language. 


THEOREM 2.8.8. [Moore Theorem] There exists an algorithm that given 
any two finite automata, always terminates and tells us whether or not they are 
equivalent. 
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We will see this algorithm in action in the following Example 2.8.9 and Exam- 
ple 2.8.10. 


EXAMPLE 2.8.9. Let us consider the two finite automata F, and Fp of Fig- 
ure 2.8.2. In order to test whether or not the two automata are equivalent, we 
construct a table which represents what can be called ‘the synchronized superposi- 
tion’ of the transition functions of the two automata as we now explain. 


automaton F} : automaton F> : 


£6 
0 
Om, 


FIGURE 2.8.2. The two deterministic finite automata F; and F> of Example 2.8.9. 


The rows and the entries of the table are labeled by pairs of states of the form 
(S1, S2). The first projection S4 of each pair is a state of the first automaton F; and 
the second projection Sə is a state of the second automaton F>. The columns of the 
table are labeled by the input values 0 and 1. 

Starting from the pairs (A, M) of the initial states there is a transition to the pair 
(A, M) for the input 0 and to the pair (B, N) for the input 1. Thus, we get the first 
row of the table (see below). Since we got the new pair (B, N} we initialize a new 
row with label (B, N}. For the input 0 there is a transition to the pair (C, P} and 
for the input 1 there is a transition to the pair (A, M}. We continue the construction 
of the table by adding the new row with label (C, P). 

The construction continues until we get a table where every entry is a label of 
a row already present in the table. At that point the construction of the table 
terminates. In our case we get the following final table. 


Synchronized transition 
function of the two finite 
automata F and F> of 
Figure 2.8.2: 


Now in this table each pair of states is made out of states which are both final or 
non-final. Precisely in this case, we say that the two automata are equivalent. 
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EXAMPLE 2.8.10. Let us consider the two finite automata Fı and F} of Fig- 
ure 2.8.3. We construct a table which represents the synchronized superposition 
of the transition functions of the two automata as we have done in Example 2.8.9 
above. 


automaton fF} : automaton F; : 


FIGURE 2.8.3. The two deterministic finite automata F and F3 of Example 2.8.10. 


Starting from the pairs (A, M} of the initial states there is a transition to the 
pair (A, M} for the input 0 and to the pair (B, N) for the input 1. From the pair 
(B, N) there is a transition to the pair (C,Q) for the input 0 and the pair (A, P} 
for the input 1. At this point it is not necessary to continue the construction of the 
table, because we have found a pair of states, namely (A, P), such that A is a final 
state and P is not final. We may conclude that the two automata of Figure 2.8.3 
are not equivalent. 


As a corollary of Moore Theorem (see Theorem 2.8.8 above) we have an enumer- 
ation method for finding a finite automaton which has the minimal number of states 
among all automata which are equivalent to the given one. Indeed, given a finite 
automaton with k states with k > 0, it is enough to generate all finite automata 
with a number of states less than k and test for each of them whether or not it is 
equivalent to the given one. However, this ‘generate and test’ algorithm has a very 
high time complexity because there is exponential number of connected graphs with 
n nodes. This implies that there is also an exponential number of finite automata 
with n nodes over a fixed finite alphabet. 


Fortunately, there is a much faster algorithm which given a finite automaton, 
constructs an equivalent finite automaton with minimal number of states. We will 
see this algorithm in action in the following example. 


EXAMPLE 2.8.11. Let us consider the finite automaton of Figure 2.8.4. 
We construct the following table which has all pairs of states of the automaton to 
be minimized. In this table in every column all pairs have the same first component 
and in every row all pairs have the same second component. 
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FIGURE 2.8.4. A deterministic finite automaton to be minimized. 


BC 


BD CD 
BE CE DE 


A B C D 


vaw 


Then we cross out the pair XY of states iff state X is not equivalent to state Y. 
Now, recalling Definition 2.1.3 on page 30, we have that a state X is not equivalent 
to state Y iff there exists an element v of the alphabet X such that 6(X,v) is not 
equivalent to d(Y, v). 

In particular, A is not equivalent to B because 6(A,a) = B, 6(B,a) = C, and B 
is not equivalent to C (indeed, B is not a final state while C is a final state). Thus, 
we cross out the pair AB and we write: ABx, instead of AB. Analogously, A is 
not equivalent to C because 6(A, b) = E, 6(C,b) = D, and E is not equivalent to D 
(indeed, F is not a final state while C is a final state). We cross out the pair AC as 
well. We get the following table: 


ABx 
ACx BC 


AD BD CD 
AK BE CE DE 


At the end of this procedure, we get the table: 


ABx 
ACx BCx 


ADx BDx CDV 
AEx BEx CEx DEx 


We did not cross out the pair CD because the states C and D are equivalent (see the 
checkmark v). Indeed, 6(C,a) = C, 6(D,a) = C, 6(C,b) = D, and 6(D, b) = D. 

Given a finite automaton we get the equivalent minimal finite automaton by 
repeatedly applying the following replacements: 
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(i) any two equivalent states, say X and Y, are replaced by a unique state, say Z, 
and 


(ii) the edges are replaced as follows: (ii.1) every labeled edge from a state A to the 
state X or Y is replaced by an edge from the state A to the state Z with the same 
label, and (ii.2) every labeled edge from the state X or Y to a state A is replaced 
by an edge from the state Z to the state A with the same label. 

Thus, in our case we get the minimal finite automaton depicted in Figure 2.8.5. 


FIGURE 2.8.5. The minimal finite automaton which corresponds to 
the finite automaton of Figure 2.8.4. 


Note that the minimal finite automaton corresponding to a given one is unique 
up to isomorphism, that is, up to: (i) the renaming of states, and (ii) the addition 
and the removal of sink states and edges to sink states. 


Now we present a second algorithm which given a finite automaton, constructs 
an equivalent finite automaton with the minimal number of states. We will see this 
algorithm in action in the following Example 2.8.13. First, we need the following 
definition in which for any given finite automaton (Q, ÈX, qo, F, ô}, we introduce the 
binary relation ~;, for every i>0. Each of the ~;’s is a subset of Q x Q. 


DEFINITION 2.8.12. [Equivalence ~ Between States] Given a finite automa- 
ton (Q, £, qo, F, ô), for any p,q E€ Q we define: 
(i) p ~o q iff p and q are both final states or they are both non-final states, 
and 
(ii) for every i > 0, p ~iyı q iff p~iq and Vu € &, we have that: 
(ii.1) Vp' € Q, if d(p,v) =p then dq € Q, (lq, v) =q and p' ~; 7), 
and 
(ii.2) Vq' E€ Q, if (q, v) =q then Ap’ € Q, (d(p,v) =p’ and p' ~; q’). 
We have that: p ~ q iff for every i > 0 we have that p ~; q. Thus, we have that: 
os Niso Vi. 
We say that the states p and q are equivalent iff p ~ q. 


It is easy to show that for every i > 0, the binary relation ~; is an equivalence 
relation. Also the binary relation ~ is an equivalence relation. 
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We have that for every i > 0, ~;,, is a refinement of ~;, that is, for all p,q E€ Q, 
p ~i+ı q implies that p ~; q. 
Moreover, if for some k>0 it is the case that ~k+1ı = ~p then ~ = (geese ~i. 


One can show that the notion of state equivalence we have now introduced is 
equal to that of Definition 2.1.3 on page 30. 


As a consequence of Myhill-Nerode Theorem, the relation ~ partitions the set Q 
of states into the minimal number of blocks such that for any two states p and q in 
the same block and for all v € X, ô(p, v) ~ ô(q, v). Thus, we may minimize a given 
automaton by constructing the relation ~. 


The following example shows how to the relation ~ in practice. 


EXAMPLE 2.8.13. Let us consider the finite automaton R whose transition func- 
tion is given in the following Table T (see page 66). The input alphabet of R is 
X = {a,b,c} and the set of states of R is {1, 2, 3, 4, 5,6,7,8}. 

We want to minimize the number of states of the automaton R. The initial state 
of R is state 1 and the final states of R are states 2, 4, and 6. According to our 
conventions, in Table T and in the following tables we have underlined the final 
states (recall Notation 2.3.10 introduced on page 38). 

The finite automaton R has been depicted in Figure 2.8.6 on page 67. 


In order to compute the minimal finite automaton which is equivalent to R we 
proceed by constructing a sequence of tables T0, T1,..., as we now indicate. For 
i > 0, Table Ti denotes a partition of the set of states of the automaton R which 
corresponds to the equivalence relation ~;. 


Table T which shows 
the transition function 
of the automaton R: 


~S Sja wj ajea Oo 


o Nja OTR wld =| 
pe |N col [Hw [we [RIN jo 


Initially, in order to construct the Table TO we partition Table T into two blocks: 
(i) the block A which includes the non-final states 1, 3, 5, 7, and 8, and 
(ii) the block B which includes the final states 2, 4 and 6. 


Then the transition function is computed in terms of the blocks A and B, in the 
sense that, for instance, ô(1,a) = B because in Table T we have that d(1,a) = 2 
and state 2 belongs to block B. 
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automaton R : b,c 


FIGURE 2.8.6. The finite automaton R whose transition function is 
shown in Table T on page 66. State 1 is the initial state and states 2, 
4, and 6 are the final states. 


Thus, we get the following Table TO where the initial state is block A because the 
initial state of the given automaton R is state 1 and state 1 belongs to block A. The 
final state is block B which includes all the final states of the given automaton R. 


Table TO: 


A 
A 
A 
A 
A 
B 
B 
L 


This Table TO represents the equivalence relation ~o because any two states which 
belong to the same block are either both final or non-final. The blocks of the 
equivalence ~o are: {1,3, 5,7,8} and {2,4,6}. 

Now, the states within block A are all pairwise equivalent because their entries 
in Table TO are all the same, namely [B B A], while the states within block B are 
not all pairwise equivalent because, for instance, 6(4,b) = B and ô(6, b) = A. 

Whenever a block contains two states which are not equivalent, we proceed by 
constructing a new table which corresponds to a new equivalence relation which 
is a refinement of the equivalence relation corresponding to the last table we have 
constructed. Thus, in our case, we partition the block B into the two blocks: (i) B1 
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which includes the states 2 and 4 which have the same row [A B B], and (ii) B2 
which includes the state 6 with row [A A B]. We get the following new table: 


Table T1 : 


Then the transition function is computed in terms of the blocks A, B1, and B2, 
in the sense that, for instance, ô(1,a) = B1 because in Table T we have that 
ô(1,a) = 2 and state 2 belongs to block B1 (see Table T1). Table T1 corresponds 
to the equivalence relation ~;. The blocks of the equivalence ~; are: {1,3,5,7, 8}, 
{2,4}, and {6}. 

Now states 1, 3, and 8 are not equivalent to states 5 and 7 because, for instance, 
d(1,a) = 6(3,a) = 6(8,a) = B1 while 6(5,a) = 6(7,a) = B2. Thus, we partition 
block A into two blocks: (i) Al which includes the states 1, 3, and 8 which have the 
same row [B1 B1 A], and (ii) A2 which includes the states 5 and 7 which have the 
same row |B2 B1 A]. We get the following new table: 


Al 


Table T2 : A2 


Bı 2f aB 
Bi | B1 
= 
Then the transition function is computed in terms of the blocks A1, A2, B1, and 


B2, in the sense that, for instance, 6(1,c) = A2 because in Table T we have that 
ô(1,c) = 5 and state 5 belongs to block A2 (see Table 2). Table T2 corresponds to 
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the equivalence relation ~. The blocks of the equivalence ~» are: {1,3,8}, {5,7}, 
{2,4}, and {6}. 

~ Now all rows within each block Al, A2, B1, and B2 of Table T2 are the same. 
Thus, we get ~3 = ~ and ~= ~s. Therefore, the minimal finite automaton equiva- 
lent to the automaton R has a transition function corresponding to Table T2. This 
minimal automaton is depicted in Figure 2.8.7 below. 


FIGURE 2.8.7. The minimal finite automaton corresponding to the 
finite automaton of Table T and Figure 2.8.6. 


EXERCISE 2.8.14. We show that the equation (0 + 01 + 10)* = (10 + 0*01)*0* 
between regular expressions holds by: 
(i) constructing the minimal finite automata corresponding to the regular expres- 
sions, and then 
(ii) checking that these two minimal finite automata are isomorphic (see Defini- 
tion 2.7.4 on page 59). 


For the regular expression (0 + 01 + 10)* we get the following transition graph: 


OS 
Sw 


By applying the Powerset Construction Procedure we get the finite automaton: 
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o. 


whose transition function is given by the following table where we have underlined 
the final states: 


Now the states ACE, BCE, and CE (marked with (*)) are equivalent and we get 
the following minimal finite automaton M1. 


Finite automaton Mı: 


ACE/BCE/CE 


For the regular expression (10 + 0*01)*0* we get the following transition graph: 
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whose transition function is given by the following table where we have underlined 
the final states: 


DEFGH | DEFGH BCDEFG 
(x) BCDEFG | DEFGH 


Now the states ABCDEFG and BCDEFG (marked with (*)) are equivalent and 
we get the following minimal finite automaton Mp. 


72 2. FINITE AUTOMATA AND REGULAR GRAMMARS 


Finite automaton Mp: 


—> 


We have that the finite automaton M is isomorphic to the finite automaton M4. 


We leave it as an exercise to the reader to prove that (0+01+10)* = 0*(01+10*0)* by 
finding the two minimal finite automata corresponding to the two regular expressions 
and then showing their isomorphism, as we have done in Exercise 2.8.14 above. 


2.9. Pumping Lemma for Regular Languages 


We have the following theorem which provides a necessary condition which ensures 
that a grammar is a regular grammar. 


THEOREM 2.9.1. [Pumping Lemma for Regular Languages| For every reg- 
ular grammar G there exists a number p> 0, called a pumping length of the gram- 
mar G, depending on G only, such that for all w € L(G), if |w| > p then there exist 
the words x,y,z such that: 


G) w=2yz, 


(ii) yf#Ae, and 
(iii) for alli >0, ry’ze€ L(G). 


The minimum value of the pumping length p is said to be the minimum pumping 
length of the grammar G. 


PROOF. Let p be the number of the productions of the grammar G. If we apply i 
productions of the grammar G from S we generate a word of length i. If we choose 
a word, say w, of length q = p+1, then every derivation of that word must have a 
production which is applied at least twice. That production cannot be of the form 
A — a because when we apply a production of the form A — a then derivation 
stops. Thus, the production which during the derivation is applied at least twice, is 
of the form A > a B. 


Case (1). Let us consider the case in which A is S. In this case the derivation of w 
is of the form 

S—*yS >* yz=w. 
Thus, if we perform i times, for any i > 0, the derivation S —* y S we get the 


derivation S —* y’z. The word y’ z is equal to xy’ z for x = £, and for any i > 0, 
ry'z € L(G). 
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Case (2). Let us consider the case in which A is different from S. In this case the 
derivation of w is of the form 

S—*rA—*ryA—* ryz=w. 
Thus, if we perform i times, for any i > 0, the derivation from A to yA we get 
the derivation S —* x A —* xy'z and thus, for any i > 0, we have that ry'z € 


L(G). 


COROLLARY 2.9.2. The language L = {a' bc’ | i > 1} is not a regular language. 
Thus, the language L = {abc |i > 0} cannot be generated by an S-extended 
regular grammar. 


PROOF. Suppose that L is a regular language and let G be a regular grammar 
which generates L. By the Pumping Lemma 2.9.1 there exist the words x, y Æ €, 
and z such that for a sufficiently large i, we have that w = abé = xyz € L and 
also for any i > 0, we have that xy’ z € L. Now, 

(i) if y does not include b then there is a word in L with the number of a’s different 
from the number of c’s, 

(ii) if y = b then the word atb? c should belong to L, and 

(iii) if y includes 6 and it is different from b then also a word with two non-adjacent 
b’s is in L. 


In all cases (i), (ii), and (iii) we get a contradiction. 


We have the following fact. 


FAcT 2.9.3. |The Pumping Lemma for Regular Languages is not a Suf- 
ficient Condition] The Pumping Lemma for regular languages is a necessary, but 
not a sufficient condition for a language to be regular. Thus, there are languages 
which satisfy this Pumping Lemma and are not regular. 


PROOF. Let us consider the alphabet N= {0,1} and following language L C &*: 

L =aet {uu® v | u,v € ot} 
where u? denotes the reversal of the word u, that is, the word derived from u by 
taking its symbols in the reverse order (see also Definition 2.12.3 on page 95). 

Now we show that L satisfies the Pumping Lemma for regular languages. 

Let us also consider the pumping length p=4 and a word w =4ef wu v with 
u,v € Xt such that |w|>4. We have that w € L. There are two cases: (a) |u| = 1, 
and (3) |u| > 1. 

In Case (a) we have that |v| > 2 and we take the subword y of the Pumping 
Lemma to be the leftmost character of the word v. 

For instance, if u = 0 and v = 10 we have that wu®v = 0010 and, for all i > 0, 
the word 001‘0 belongs to L (indeed, for all i > 0, the leftmost part of 001'0 is a 
palindrome). 


In Case (3) we take the subword y of the Pumping Lemma to be the leftmost 
character of the word u. 

For instance, if u = 01 and v = 1 we have: uuv = 01101 and, for all i > 0, 
the word 0'1101 belongs to L because, for all i > 0, the leftmost part of 01101 
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is a palindrome. Note that also for i = 0, the leftmost part of the word 0'1101 is 
a palindrome. 

Indeed, it is the case that, for any word (as) € Et, with a € X and s € XY, we 
have that: (as) (as)? = ass"a and, thus, if we take the leftmost character away, 
we get the word s s?a whose leftmost part s s? is a palindrome. 

This concludes the proof that the language L satisfies the Pumping Lemma for 
regular languages. 

It remains to show that L is not a regular language. This is an obvious conse- 
quence of the fact that, as we will see on page 153, the language {u u? | u € Ut} is 
not regular. 0 


Note that there is a statement which provides a necessary and a sufficient condi- 
tion for a language to be regular: it is the Myhill-Nerode Theorem (see Theorem 2.8.7 
on page 61). 


2.10. A Parser for Regular Languages 


Given a regular grammar G we can construct a parser for the language L(G) by 
performing the following three steps. 


Step (1). Construct a finite automaton A corresponding to the grammar G (see 
Section 2.2 starting on page 33). 


Step (2). Construct a finite automaton D as follows: if A is deterministic then D 
is A else if A is nondeterministic then D is the finite automaton equivalent to A (it 
can be constructed from A by using the Powerset Construction of Algorithm 2.3.11 
on page 39). 

Step (3). Construct the parser by using the transition function ô of the automaton D 
as we now indicate. 


Let stp be the string to parse. We want to check whether or not stp € L(G). We 
start from the initial state of D and, by taking one symbol of stp at a time, from left 
to right, we make a transition from the current state to a new state according to the 
transition function ô of the automaton D, until the string stp is finished. Let q be 
the state reached when the transition corresponding to the rightmost symbol of stp 
has been considered. If q is a final state then stp € L(G), otherwise stp ¢ L(G). 

In order to improve the efficiency of the parser, instead of the transition func- 
tion 6 of the automaton D, we can consider the transition function of the minimal 
automaton corresponding to D (see Section 2.8 starting on page 59). 


Now we present a Java program which realizes a parser for the language generated 
by a given regular grammar G, by using the finite automaton which is equivalent 
to G, that is, the finite automaton which accepts the language generated by L(G). 


Let us consider the grammar G with axiom S and the following productions: 
S—aAla 

A-aAla|aB 

B-bA\|b 
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The minimal finite automaton which accepts the language generated by the gram- 
mar G, is depicted in Figure 2.10.1 on page 75. By using Kleene Theorem (see 
Theorem 2.5.10 on page 47), one can show that this grammar generates the regular 
language denoted by the regular expression a(a+ab)*. 


FIGURE 2.10.1. The minimal finite automaton which accepts the reg- 
ular language generated by the grammar with axiom S and produc- 
tions: S—>aA|a, A-aA|lalaB, B-bA|b. 

States 0, 1, and 2 correspond to the nonterminals S, A, and B, re- 
spectively. State 3 is a sink state. 


In our Java program we assume that: 

(i) the terminal alphabet is {a,b}, 
(ii) the states of the automaton are denoted by the integers 0, 1, 2, and 3, 
(iii) the initial state is 0, 
(iv) the set of final states is {1,2}, and 
(v) the transition function 6 is defined as follows: 

ô(0,a)=1; ô(0,b)=3; ô(1,a)=2; Ol. b= 3; 

ô(2,a)=2; ô(2,b)=1; ô(3,a)=3; &(3,b)=3. 
States 0, 1, and 2 correspond to the nonterminals S, A, and B, respectively. State 3 
is a sink state. 
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/** 
x 23535335355 
* PARSER FOR A REGULAR GRAMMAR USING A FINITE AUTOMATON 
x 2355335555 
* 
* The terminal alphabet is {a,b}. 
* The string to parse belongs to {a,b}*. It may also be the empty string. 
* 
* Every state of the automaton is denoted by an integer. 
* The transition function is denoted by a matrix with two columns, one 
* for ‘a’ and one for ‘b’, and as many rows as the number of states of 
* the automaton. 
x ee 
*/ 


public class FiniteAutomatonParser { 


// stp is the string to parse. It belongs to {a,b}*. 
private static String stp = "aaba"; 


// 1stp1 is (length -1) of the string to parse. It is only used 
// in the for-loop below. 
private static int lstp1 = stp.length()-1; 


// The initial state is 0. 
private static int state = 0; 


// The final states are 1 and 2. 
private static boolean isFinal (int state) { 
return (state == || state == 2); 


> 


// The transition function is denoted by the following 4x2 matrix. 
// We have 4 states. The sink state is state 3. 
private static int [] [] transitionFunction = { 


{1,3}, // row O for state 0 
{2,3}, // row 1 for state 1 
{2,1}, // row 2 for state 2 
{3,3}, // row 3 for state 3 


| | ---------------------- 2-2-2 22 2 oo nee nnn nn een nn een nn nee nner 
public static void main (String [] args) { 


// In the for-loop below ps is the pointer to a character of 
// the string to parse stp. We have that: 0 <= ps <= lstp1. 
int ps; 


// ‘a? is at column 0 and ‘b’ is at column 1. 
// Indeed, ‘a’ - ‘a’ is 0 and ‘b’ - ‘a’ is 1. 
// There is a casting from char to int for the - operation. 
for (ps=0; ps<=lstp1; ps++) { 
state = transitionFunction [state] [stp.charAt(ps) - ’a’]; 


> 


System.out.print("\nThe input string\n " + stp + "\nis "); 
if (!isFinal(state)) { System.out.print("NOT "); }; 
System.out.print ("accepted by the given finite automaton. \n"); 
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/** 

x Z335 
* The transition function of our finite automaton is: 

* 

* (0) 1 

* | fa? | p 

* se e 

* state O | 1 | 3 

o o Reese a [Sean 

* state 1 | 2 | 3 

Bo a e esas eee 

* state 2 | 2 | 1 

BS hate Eee eee 

* state 3 | 3 | 3 

SoS |2 [Ss 

* 

* The initial state is 0. The final states are 1 and 2. 

* ‘a? is at column O and ‘b?’ is at column 1. 

Co tee ane oe ee ee oe eee eee eee eee occa 
* input 

e eee 

* javac FiniteAutomatonParser.java 

* java FiniteAutomatonParser 

* 

* output 

Se E ees 

* The input string 

* aaba 

* 1s accepted by the given finite automaton. 

Beta a See Ey ee ON ee Se et A A E EEE Eg oh eh ee et re SS AA 


For stp = "baaba" we get: 


baaba 


* 
* 

* The input string 

* 

* is NOT accepted by the given finite automaton. 


Now we present a different technique for constructing a parser for the language 
L(G) for any given right linear grammar G. It is assumed that € ¢ L(G). 

We will see that technique in action in the following example. Let us consider 
the right linear grammar G with the following four productions: 


1. Poa 

2. Q—>b 

3. P> aQ 

4. Q—>bP 

The number k (> 1) to the left of each production is the so called sequence order of 
the production. These productions can be represented as a string which is the con- 


catenation of the substrings, each of which represents a single production according 
to the following convention: 
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(i) every production of the form A — a B is represented by the substring Aa B, 
and 
(ii) every production of the form A — a is represented by the substring A a ., where 
‘, is a special character not in Vr U Vy. 

Thus, the above four productions can be represented by the following string gg, 
short for ‘given grammar’: 


production = Poa |Q—>b |P—-aQ|Q-bP 
g SR a), OD ee es OO b P 
position of the character = 0 12/3 4 5/6 7 8|9 10 11 
sequence order of the production = 1 2 3 4 


In the above lines the vertical bars have no significance: they have been drawn only 
for making it easier to visualize the productions. Underneath the string gg, viewed as 
an array of characters, we have indicated: (i) the position of each of its characters 
(that position is the index of the array where the character occurs), and (ii) the 
sequence order of each production. For instance, in the string gg the character Q 
occurs at positions 3, 8, and 9, and the sequence order of the production Q — b P 
is 4. By writing gg|i] = A we will express the fact that the character A occurs at 
position 2 in the string gg. 


NOTATION 2.10.1. [Identification of Productions] We assume that every pro- 
duction represented in the string gg is identified by the position p of the nonterminal 
symbol of its left hand side, or by its sequence order s. We have that: s = (p/3)+1. 
Thus, for instance, for the grammar G above, we have that the production Q — b P 
is identified by the position 9 and also by the sequence order 4. 0 


We also assume that gg[0], that is, the leftmost character in gg, is the axiom of 
the grammar G. 

Recalling the results of Section 2.2 starting on page 33, we have that a right 
linear grammar G corresponds to a finite automaton, call it M, which, in general, is a 
nondeterministic automaton (recall Algorithm 2.2.2 on page 39. From an S-extended 
type 3 grammar that algorithm constructs a nondeterministic finite automaton). We 
also have that the nonterminal symbols occurring at the positions 0, 3, 6,... and 
2, 5, 8,... of the string gg, that is, the nonterminal symbols of the grammar G, can 
be viewed as the names of the states of the nondeterministic finite automaton M 
corresponding to G. The symbol ‘.’ can be viewed as the name of a final state of 
the automaton M, as we will explain below. 

When a string stp is given in input, one character at a time, to the automaton 
M for checking whether or not stp€ L(G), we have that M makes a move from 
the current state to a new state, for each new character which is given in input. M 
accepts the string stp iff the move it makes for the rightmost character of stp, takes 
M to a final state. We have that: 


(i) if we apply the production A — a B, then the automaton M reads the input 
character a and makes a transition from state A to state B, and 
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(ii) if we apply the production A — a, the automaton M reads the input character a 
and makes a transition from state A to a final state, if any. 


The leftmost character of the string gg is gg[0] and it is P in our case. We denote 
the length of gg by lgg. In our case lgg is 12. Thus, the rightmost character P of gg 
is gg|lgg—1]. In the program below (see Section 2.10.1 starting on page 82), we have 
used the identifier 1gg1 to denote lgg—1. The pointer to a character of the string gg 
is called pg, short for ‘pointer to grammar’. Thus, gg = gg[0| ... gg{lgg—1], and for 
pg = 0,...,lgg—1, the character of gg occurring at position pg is gg|pg]. 

The leftmost character of the string stp to parse is stp|0]. We denote the length 
of stp by Istp. Thus, the rightmost character of stp is stp|lstp—1]. In the program 
below (see Section 2.10.1) we have used the identifier 1stp1 to denote Istp—1. The 
pointer to a character of the string stp is called ps, short for ‘pointer to string’. 
Thus, stp = stp|0| ... stp|lstp—1], and for ps= 0,...,lstp—1, the character of stp 
occurring at position ps is stp|ps]. 


In our example the finite automaton M obtained from the grammar G by ap- 
plying Algorithm 2.2.2 on page 34, is nondeterministic. Indeed, for instance, for 
the input character a, M makes a transition from state P either to state Q, if the 
production P — aQ is applied, or to a final state, if the production P — a is 
applied. 

In order to check whether or not stp belongs to L(G), we may use the automa- 
ton M. The nondeterminism of M can be taken into account by a backtracking 
algorithm which explores all possible derivations from the axiom of G, and each 
derivation corresponds to a sequence of moves of M. 

The backtracking algorithm is implemented via a parsing function, called parse, 
whose definition will be given below. The function parse has the following three 
arguments: 


(i) al (short for ancestor list), which is the list of the productions which are ancestors 
of the current production, that is, the list of the positions of the nonterminal symbols 
which are the left hand sides of the productions which have been applied so far for 
parsing the prefix stp[0] ...stp|ps—t] of string stp, 


(ii) pg, which is the current production, that is, the position of the nonterminal 
symbol which is the left hand side of the current production (that production is 
represented by the substring: gg[pg] gg|pg+]] gglpg+2]), and 


(iii) ps, which is the position of the current character of the string stp, that is, the 
current character to be parsed is stp|ps| (and we must have that stp|ps] = gg|pg+] 
for a successful parsing of that character). 


The initial call to the function parse is: parse(|],0,0). In this function call 
we have that: (i) the first argument |] is the empty list of ancestor productions, 
(ii) the second argument 0 is the position of the left hand side gg|0| of the leftmost 
production of the axiom of the given grammar G (that is, 0 is the position of the 
axiom gg[0] of the grammar G), and (iii) the third argument 0 is the position of the 
leftmost character stp[0] of the input string stp. 
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Now, in order to explain how the function parse works, let us consider the 
following situation during the parsing process. 

Suppose that we have parsed the prefix ab of the string stp= abab and the 
current character to be parsed is stp|2], which is the second character a from the 
left. Thus, ps= 2 (see Figure 2.10.2 on page 81). Let us assume that in order 
to parse the prefix ab, we have first applied the production P — a Q and then the 
production Q — b P. Thus, the ancestor list is [6 9] with head 9. Indeed, (i) the first 
production P — aQ is identified by the position 6, and (ii) the second production 
Q — bP is identified by the position 9. 


REMARK 2.10.2. Contrary to what it is usually assumed, we consider that the 
ancestor lists grows ‘to the right’, and its head is its rightmost element. 0 


Since the right hand side of the last production we have applied is b P, the current 
production is the leftmost production of the grammar G whose left hand side is P. 
This production is P — a (which is represented by the string P a.) and its left 
hand side P is at position 0 in the string gg. Thus, pg = 0. 

Figure 2.10.2 depicts the parsing situation which we have described. The values 
of the variables gg, stp, al, pg, and ps are as follows. 


g =~ Pa.Qb.PaQQ b P : the productions of the given grammar. 
0123 45 67 8 9 10 11 : the positions of the characters. 
1 2 3 4 : the sequence order of the productions. 
stp = abab : the string to parse. 
al = [69] : the ancestors list with head 9. 
pg =) : the current production is P > a 
and its left hand side P is gg|pg]. 
ps = 2 : the current character a is stp|ps]. 


Before giving the definition of the function parse, we will give the definition of two 
auxiliary functions: 


(i) findLeftmostProd(pg) and 
(ii) findNextProd(pq). 


The function findLeftmostProd(pg) returns the position, if any, which identifies in 
the string gg the leftmost production for the nonterminal occurring in gg|pg]. If there 
is no such production, then findLeftmostProd(pg) returns a number which does not 
denote any position. In the program below we have chosen to return the number 
—1 (see Section 2.10.1). For instance, if (i) gg|pg|= P, (ii) the leftmost production 
for P in gg is Pa ., and (iii) the symbol P of Pa. occurs at position 0 in gg, then 
findLeftmostProd(pq) returns 0. 


The function findNextProd(pg) returns the smallest position greater than pg+2, 
if any, which identifies in the string gg the next production whose left hand side 
is the nonterminal in gg|pg]. If there is no such production, then findNextProd(pq) 


2.10. A PARSER FOR REGULAR LANGUAGES 81 


The ancestor list is [6,9]. It grows ‘to the right’ and its head is 9. 
6 identifies the production P — a Q (indeed, gg|6]=P) and 
9 identifies the production Q —> b P (indeed, gg[9] =Q). 


current 
ancestor list : production : 
6 9 0 0 identifies the production P — a 


| (indeed, gg[0]= P). 
3. P—>aQ 4. Q—bP 1. P—>—a 


a =r ) =  (/ : prefix of the string 


FIGURE 2.10.2. Parsing the string stp = abab, given the grammar 
with the four productions: 1. P —> a, 2. Q —> b, 3. P — aQ, and 
4. Q — bP. We have parsed the prefix = a b, and the current character 
is the second a from the left, that is, stp[2]. 


returns a number which does not denote any position. In the program below we 
have chosen to return the number —1 (see Section 2.10.1). 


Here is the tail recursive definition of the function parse. Note that in this 
definition the order of the if-then constructs is significant, and when the condition 
of an if-then construct is tested, we can rely on the fact that the conditions in all 
previous if-then constructs are false. 


parse(al, pg, ps) = 
if al= || A pg = —1 then false 
else if pg = —1 then parse(tail(al), findNextProd(head(al)),ps—1) (A) 
else if (gg[pg+1] 4 stp[ps]) V 
(gglpg+2] =‘. A ps # lstp1) V 
(gglpg+2] A ‘v A ps = Istp1) 


then parse(al, findNextProd( pg), ps) (B) 
else if (gg|pg+2] =‘.’ A ps =Istp1) then true 
else parse(cons(pq, al), findLeftmostProd(pg+2), ps+1) (C) 


where tail, head, and cons are the usual functions on lists. For instance, given 
the list | = [5 7 2] with head 2, we have that: tail(l) = [5 7], head(l) = 2, and 
cons(2, [5 7]) = [5 7 2]. 


In Case (A) we have that al # |] and pg= —1. In this case we look for an 
alternative production of the nonterminal symbol which occurs in the string gg in 
the position indicated by the head of the ancestor list (see in the Java program of 
Section 2.10.1 Case (A) named: "Alternative production from the father"). 
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In Case (B) we look for an alternative production for the nonterminal of the 
left hand side of the current production (see in the Java program of Section 2.10.1 
Case (B) named "Alternative production"). 


In Case (C) we have that gg|pg+2] # ‘° and ps < lstp1. In this case the cur- 
rent production is capable to generate the terminal symbol in stp|ps| and thus, we 
‘go down the string’ by doing the following actions (see in the Java program of 
Section 2.10.1 Case (C) named: "Go down the string"): 


(i) the position in gg of the left hand side of the current production is added to the 
ancestor list and becomes its new head, 


(ii) the new current production becomes the leftmost production, if any, of the 
nonterminal symbol, if any, at the rightmost position on the right hand side of the 
previous current production, and 


(iii) the new current character becomes the character which is one position to the 
right of the previous current character in the string stp. 


2.10.1. A Java Program for Parsing Regular Languages. 


In this section (see pages 84-89) we present a program written in Java, that im- 
plements the parsing algorithm which we have described at the beginning of Sec- 
tion 2.10. This program has been successfully compiled and executed using the 
Java 2 Standard Edition (J2SE) Software Development Kit (SDK), Version 1.4.2, 
running under Linux Mandrake 9.0 on a Pentium III machine. The Java 2 Standard 
Edition Software Development Kit can be found at http://java.sun.com/j2se/. 


In our Java program we will print an element n, with 0 < n < lgg—l, of the 
ancestor list as the pair k.P, where: (i) k is the sequence order (see page 77) of the 
production identified by the position n in the string gg, and (ii) P is the production 
whose sequence order is k. nonterminal symbol of the left hand side of that produc- 
tion. Since in the string gg every production is represented by a substring of three 
characters, we have that the sequence order of the production identified by n (that 
is, whose left hand side occurs in the string gg at position n) is (n/3) + 1 (see the 
method pPrint(int i) in our Java program below). 


We will print the current production identified by the position pg, as the string: 
gg\p9| — gglpg+] gg|pg+2], and we will not print gg|pg+2| if it equal to ‘.’. The 
current character in the string stp is the one at position ps. Thus, it is stp[ps]. 


In the comments at the end of our Java program below (see page 89), we will 
show a trace of the program execution when parsing the string stp = aba, given 
the right linear grammar (different from the one of Figure 2.10.2) with the following 
productions: 


1. Poa 

2. P-aP 
3. Q—bP 
4. P-+aQ 
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[ f current 
ancestor list : : 
production : 
9 6 0 


4. P>aQ 3. Q— bP 1 Poa 


ys 5 e (3) 

P (0) —* P (2) ; P (4) 
$ RO | 8) 
TE P (9) 
aes 


FIGURE 2.10.3. The search space when parsing the string stp = aba, 
given the regular grammar whose four productions are: 1. P — a, 
2. P —aP, 3.Q—bP, 4. P — aQ. The sequence order of the 
productions corresponds to the top-down order of the nodes with the 
same father node. During backtracking the algorithm generates and 
explores node (0) through node (8) in ascending order. Nodes (9) and 
(10) are not generated. 


In Figure 2.10.3 we have shown the search space explored by our Java program when 
parsing the stringaba. Using the backtracking technique, the program starts from 
node 0, which is the axiom of our grammar, and then it generates and explores 
node (1) through node (8) in ascending order. In the upper part of that figure we 
have also shown the ancestor list and the last current production when parsing is 


finished. 


Si i i i i i i i i i i i a i i i i i i a ao 


Se i i i i i E. 


w i a a e 


* * 
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The input grammar is given as a string, named ‘gg’, of the form: 


terminal = fats Z2 

nonterminal ::= ‘A’..‘Z? 

rsides = terminal ‘.’ | terminal nonterminal 
symprod = nonterminal rsides 

grammar = symprod | symprod grammar 


Note that epsilon productions are not allowed. 

Note also that the definition of rsides uses a right linear production. 
When writing the string gg which encodes the productions of the given 
input grammar, the productions relative to the same nonterminal need 
not be grouped together. For instance, gg = "Pa.PaPQbPPaQ" encodes the 
following four productions: 


1. P->a 2. P->aP 3. Q->bP 4. P->aQ 


where the number k (>=1) to the left of each production is the ‘sequence 
order’ of that production. The productions are ordered from left to 
right according to their occurrence in the string gg. 
The function ‘length’ gives the length of a string. 
The string to be parsed is the string ‘stp’. Each character in stp 
belongs to the set {‘a’,...,‘z’}. 
Note that a left linear grammar of the form: 
rsides ::= terminal ‘.’ | nonterminal terminal 
can always be transformed into a right linear grammar of the form: 
rsides ::= terminal ‘.’ | terminal nonterminal 
that is, left-recursion can be avoided in favour of right-recursion. 
(see: Esercise 3.7 in Hopcroft-Ullmann: Formal Languages and Their 
Relation to Automata. Addison Wesley. 1969) 
A production in the string gg is identified by the position pg where 
its left hand side occurs. For instance, if gg = "Pa.PaPQbPPaQ", 
the production P->aQ is identified by pg == 9. The sequence order k of 
a production identified by pg is (pg/3)+1. 
It should be the case that: 0 <= pg <= length(gg)-1. 
If we have that pg == -1 then this means that: 
(i) either there is no production for ‘.’ 
(ii) or the leftmost production or next production to be found for 
a given nonterminal does not exist (this is the case when no production 
exists for the given nonterminal or all productions for the given 
nonterminal have already been considered). 
The ancestorList stores the productions which have been used so far for 
parsing. Each element n of the ancestorList is printed as a pair ‘k. P’ 
where k is the sequence order of the production identified by n, 
that is, (n/3)+1, and P is the production whose sequence order is k. 
The ancestorList is printed with its head ‘to the right’. 
There is a global variable named ‘traceon’. If it is set to ‘true’ then 
we trace the execution of the method parse(al,pg,ps). 
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import java.util.ArrayList; 
import java.util.Iterator; 


class List { 
/** --------------------------------------------------------------------- 
* The class ‘List’ is the type of a list of integers. The functions: 
* cons(int i), head(), tail(), isSingleton(), isNull(), copy(), and 
* printList() are available. 


*/ 
public ArrayList<Integer> list; 
public List() { 

list = new ArrayList<Integer>(); 


public void cons(int datum) { 
list.add(new Integer (datum) ) ; 


public int head() { 
if (list.isEmpty()) 
{System.out.println("Error: head of empty list!");}; 
return (list.get(list.size() - 1)).intValue(); 
} 


public void tail() { 
if (list.isEmpty()) 
{System.out.println("Error: tail of empty list!");}; 
list.remove(list.size() - 1); 


public boolean isSingleton() { 
return list.size() == 1; 


public boolean isNull() { 
return list.isEmpty() ; 


/* 
[[ -------------------------- Copying a list using clone() -------------- 
public List copy { 
List copyList = new List(); 
copyList.list = (ArrayList<Integer>)list.clone() ; 
// above: unchecked casting from Object toArrayList<Integer> 
return copyList; 


*/ 
[[ -------------------------- Copying a list without using clone() ------ 
public List copy( { 
List copyList = new List(); 
for (Iterator iter = list.iterator(); iter.hasNext(); ) { 
Integer k = (Integer)iter.next(); 
copyList.list.add(k) ; 


return copyList; 
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public void printListQ { // overloaded method: arity 0 
System.out.print("[ "); 
for (Iterator iter = list.iterator(); iter.hasNext(); ) { 
System.out.print((iter.next()).toString() +" "); 


System.out.print("]"); 


public class RegParserJava { 


static String gg, stp; // gg: given grammar, stp: string to parse. 
static int lggi,lstp1; 
static boolean traceon; 

[** ---------------------------------------------------------- == == -- == = 
The global variables are: gg, stp, lstpl, traceon. 
lggi is the length of the given grammar gg minus 1. 
lstp1 is the length of the given string to parse stp minus 1. 


begins from 0. 
Thus, for instance, 


* 
* 
* 
* 
* The ‘minus 1’ is due to the fact that indexing in Java (as in C++) 
* 
* 
* stp=abcab length(stp)=5 stp.charAt(2) is c. 

* 


index: 01234 I1stp1=4 
PELARE IE eo DENAN ER AO AINA INTI GRAPE EARS OS Re LS SEEE CREE NIA SO eS PEARS SOLES, IE BORE SOOO Ot eS SEER ee SEE ANE EELEE 
*/ 
EE E E E E ee ae el ee 
// printing a given production 


private static void pPrint(int i) { 


System.out.print(i/3+1 + ". " + gg.charAt(i) + "->" + gg.charAt(it1)); 
if (gg.charAt(it2) != ’.’ ) {System.out.print(gg.charAt (i+2)) ;} 
} 
[| _------------------------------------------------------------------------ 
// printing a given grammar 


private static void gPrint() { 


int i=0; 
while (i<=lgg1) {pPrint(i); System.out.print(" "); i=i+3;}; 
System.out.print("\n") ; 
} 
PAET E E SE E E EE TE E ETE 
// printing the ancestorList: the head is ‘to the right’ 


private static void printNeList(List 1) { 
List 11 = l.copyQ; 
if (11.isSingleton()) 
{pPrint (11.head()) ;} 


else 
{11.tail(); printNeList (11); 
System.out.print(", "); pPrint(1.head());}; 
private static void printList(List 1) { // overloaded method: arity 1 


if (1.isNull()) {System.out.print("[]") ;} 
else {System.out.print("["); printNeList(1); System.out.print("]");}; 
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// tracing 
private static void trace(String s, List al, int pg, int ps) { 
if (traceon) 
{System.out.print("\n\nancestorsList: "ys 
printList(al); // Printing ancestorsList using printList of arity 1. 
System.out.print("\nCurrent production: "); 


if (pg == -1) { System.out.print("none"); } else { pPrint(pg); }; 
System.out.print("\nCurrent character: " + stp.charAt(ps)); 
System.out.print(" => " + s); 
} 
} 
/I_------------------------------------------------------------------------ 
// next production 


private static int findNextProd(int p) { 
char s = gg.charAt (p); 
do {p = p+3;} while (!((p>(1gg1)) || (gg.charAt(p) == s))); 
if (p <= lgg1) { return p; } else { return -1; } 


// leftmost production 
private static int findLeftmostProd(int p) { 
char s = gg.charAt(p); 
int i=0; 
while ( (i<=lgg1) && (gg.charAt(i) != s)) { i = i+3; }; 
if (i <= lggi) { return i; } else { return -1; } 


// parsing 
private static boolean parse(List al, int pg, int ps) { 


if ((al.isNull()) && (pg == -1)) 
{trace("Fail.",al,pg,ps); 
return false; 


} 
else if (pg == -1) // Case (A) --- 
{trace("Alternative production from the father.",al,pg,ps); 
int h = al.head(); // al.head() is computed before al.tail() 
al.tail(; 
ps--; 
return parse(al,findNextProd(h) ,ps) ; 


else if ((gg.charAt(pgt1) != stp.charAt(ps)) E // Case (B) --- 
((gg.charAt (pg+2)==?.?) && (ps != lstp1)) || 
((gg.charAt(pg+2)!=?.?) && (ps == lstp1)) ) 
{trace("Alternative production.",al,pg,ps); 
return parse(al,findNextProd(pg) ,ps) ; 
else if ((gg.charAt(pg+2) == ’.’) && (ps == lstp1)) 
{trace("Success.\n",al,pg,ps) ; 
= true; 
else {trace("Go down the string.",al,pg,ps); // Case (C) ---- 
al.cons (pg); 
pst+; 
return parse(al,findLeftmostProd(pg+2) ,ps) ; 
} 
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/I_------------------------------------------------------------------------ 
public static void main(String[] args) { 
traceon = true; 


gg = "Pa.PaPQbPPaQ" ; // example 0 
stp = "aba"; // true 

lgg1 = gg.length() - 1; 

Istpl = stp.length() - 1; 
System.out.print("\nThe given grammar is: "); 
gPrint(); 

char axiom = gg.charAt(0); 
System.out.print("The axiom is " + axiom + "."); 
List al = new List(); 

int pg = 0; 

int ps = 0; 


boolean ans = parse(al,pg,ps); 

System.out.print("\nThe input string\n "+ stp + "\nis "); 
if (!ans) {System.out.print("NOT ");}; 
System.out.print("generated by the given grammar.\n") ; 


Ww 
Ww 


In our system the Java compiler for Java 1.5 is called ‘javac’. 
Analogously, the Java runtime system for Java 1.5 is called ‘java’. 


Other examples: 


gg = "Pa.PbQPbP"; // example 1 
stp = "ba"; // true 


gg="PbPPa."; // example 2 
stp="aba"; // false 
stp="bba"; // true 


gg="Pa.PaQQb.QbP"; // example 3 
stp="ab"; // true 
stp="ababa"; // true 
stp="aaba"; // false 


gg="Pa.Qb.PbQQaQ" ; // example 4 
stp="baaab"; // true 
stp="baab"; // true 
stp="bbaaba"; // false 


gg="Pa.Qb.PaQQbPPaP" ; // example 5. Note: PaQ and PaP 
stp="aabaaa"; // true 
stp="aabb"; // false 


javac RegParserJava. java 
java RegParserJava 


Bact 
Si i i i i i i i i i a i i i i i i i i i i i i i i a i i i i i i i i 
* 
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output: traceon 


The given grammar is: 1. 


The axiom is P. 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


ancestorsList: 
Current production: 
Current character: 


The input string 
aba 
is generated by the 


true. 
P->a 2. P->aP 3. Q->bP 4. P->aQ 
[] 
1. P->a 
a => Alternative production. 
[] 
2. P->aP 
a => Go down the string. 
[2. P->aP] 
1. P->a 
b => Alternative production. 
[2. P->aP] 
2. P->aP 
b => Alternative production. 
[2. P->aP] 
4. P->aQ 
b => Alternative production. 
[2. P->aP] 
none 
b => Alternative production from the father. 
[] 
4.P->aQ 
a => Go down the string. 
[4. P->aQ] 
3. Q->bP 
b => Go down the string. 
[4. P->aQ, 3. Q->bP] 
1. P->a 


a => Success. 


given grammar. 
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2.11. Generalizations of Finite Automata 


According to its definition, a deterministic finite automaton can be viewed as having 
aread-only input tape without endmarkers, whose head, called the input head, moves 
to the right only. No transition of states is made without reading an input symbol 
and in that sense we say that a finite automaton is not allowed to make ¢-moves 
(recall Remark 30 on page 30). According to Definition 2.1.4 on page 30, we have 
that a finite automaton accepts an input string if it makes a transition to a final state 
when the input head has read the rightmost symbol of the input string. Initially, 
the input head is on the leftmost cell of the input tape (see Figure 2.11.1). 


Read Only Input Tape without endmarkers 


the input head moves from left to right 


Finite 

Control 
FIGURE 2.11.1. A one-way deterministic finite automaton with a 
read-only input tape and without endmarkers. 


A finite automaton can be generalized by assuming that it has an input read-only 
tape without endmarkers and its head may move to the left and to the right. This 
generalization is called a two-way deterministic finite automaton. We assume that 
a two-way deterministic finite automaton accepts an input string iff it makes a 
transition to a final state while the input head has read the rightmost input symbol. 

We also assume that a move which (i) reads the input character i„ and makes 
the input head to go to the right, or (ii) reads 7; and makes the input head to go 
to the left, can be made but it is a final move, that is, no more transitions of states 
can be made. After any such move, the finite automaton stops in the state where it 
is after that move. 

One can show that two-way deterministic finite automata accepts exactly the 
regular languages [6]. 

If we allow any of the following generalizations (in any possible combination) 
then the class of accepted languages remains that of the regular languages: 
(i) at each move the input head may move left or right or remain stationary (this last 
case corresponds to an ¢-move, that is, a state transition when no input character 
is read), 
(ii) the automaton in the finite control is nondeterministic, and 
(iii) the input tape has a left endmarker ¢ and a right endmarker $ (see Figure 2.11.2) 
which are assumed not to be symbols of the input alphabet X. In this last general- 
ization we assume that the input head initially scans the left endmarker ¢ and the 
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Read Only Input Tape Read Only Input Tape 
without endmarkers with endmarkers 


Fes ah aki (at cst Sah Ea ON Je > ee ee ee eee a 
the input head moves the input head moves 


from left to right from left to right 
and vice versa and vice versa 


Finite 


Control Control 


FIGURE 2.11.2. A two-way deterministic finite automaton with a 
read-only input tape, without and with endmarkers (see the left and 
the right pictures, respectively). 


acceptance of a word is defined by the fact that the automaton makes a transition 
to a final state while the input head reads any cell of the input tape. 


A different generalization of the basic definition of a finite automaton is done 
by allowing the production of some output. The production of some output can 
be viewed as a generalization of either (i) the notion of a state, and we will have 
the so called Moore Machines (see Section 2.11.1), or (ii) the notion of a transition 
and we will have the so called Mealy Machines (see Section 2.11.2). Acceptance is 
by entering a final state while the input head moves to the right of the rightmost 
input symbol. As for the basic notion of a finite automaton, ¢-moves are not allowed 
(recall Remark 2.1.2 on page 30). 


2.11.1. Moore Machines. 


A Moore Machine is a finite automaton in which together with the transition func- 
tion 6, we also have an output function À from the set of states Q to the so-called 
output set Q, which is a given set of symbols. No e-moves are allowed, that is, 
the transition function ô is a function from Q@x to Q and a new symbol of the 
input string should be read each time the function 6 is applied. Thus, we associate 
an element of Q with each state in Q , and we associate an element of Q with a 
(possibly empty) sequence of state transitions. 

A Moore Machine with initial state qo associates the string A(qo) with the empty 
sequence of state transitions. 


2.11.2. Mealy Machines. 


A Mealy Machine is a finite automaton in which together with the transition func- 
tion ô, we have an output function u from the set Qx}, where Q is a finite set of 
states and X is the set of input symbols, to the output set Q, which is a given set 
of symbols. No -moves are allowed, that is, the transition function 6 is a function 
from @ x to Q and a new symbol of the input string should be read each time the 
function ô is applied. Thus, we associate an element of Q* with a (possibly empty) 
sequence of state transitions. 
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A Mealy Machine associates the empty string £ with the empty sequence of state 
transitions. If we forget about the output produced by the Moore Machine for the 
empty input string then: (i) for each Moore Machine there exists an equivalent 
Mealy Machine, that is, a Mealy Machine which accepts the same set of input words 
and produces the same set of output words, and (ii) for each Mealy Machine there 
exists an equivalent Moore Machine. 


2.11.3. Generalized Sequential Machines. 


Mealy Machines can be generalized to Generalized Sequential Machines which we 
now introduce. These machines will allow us to introduce the notion of (deterministic 
and nondeterministic) translation of words between two given alphabets. 


DEFINITION 2.11.1. [Generalized Sequential Machine and «-free Gener- 
alized Sequential Machine] A Generalized Sequential Machine (GSM, for short) 
is a 6-tuple of the form: (Q, X, Q, ð, qo, F), where Q is finite set of states, X is the 
input alphabet, Q is the output alphabet, ô is a partial function from Q x X to the set 
of the finite subsets of Q x Q*, called the transition function, qo in Q is the initial 
state, and F C Q is the set of final states. 

A GSM is said to be «-free iff its transition function 6 is a partial function from 
Q x È to the set of the finite subsets of Q x QF, that is, when an e-free GSM makes 
a state transition, it never produces the empty word e. 


Note that by definition a generalized sequential machine is a nondeterministic 
machine. 

The interpretation of the transition function of a generalized sequential machine 
is as follows: if the generalized sequential machine is in the state p and reads the 
input symbol a, and (q,w) belongs to 6(p,a) then the machine makes a transition 
to the state q, and produces the output string w € Q*. 

As in the case of a finite automaton, a GSM can be viewed as having a read-only 
input tape without endmarkers whose head moves to the right only. Acceptance is 
by entering a final state, while the input head moves to the right of the rightmost 
cell containing the input. Initially, the input head is on the leftmost cell of the input 
tape. No -moves on the input are allowed, that is, a new symbol of the input string 
should be read each time a move is made. 


Generalized Sequential Machines are useful for studying the closure properties 
of various classes of languages and, in particular, the closure properties of regular 
languages. They may also be used for formalizing the notion of a nondeterministic 
translation of words from &* to Q* [9]. The translation is obtained as follows. 

First, we extend the partial function 6 whose domain is Q x ©, to a partial 
function, denoted by 6*, whose domain is Q x &*, as follows: 


(i) for any pE Q, 
6*(p,€) = {(p, €)} 
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(ii) for any p E€ Q, x € d*, anda € X, 
ô*(p, xa) = {(q, WW.) | (p1,W1) € O*(p, x) and (q, w2) = O(p1, a) 


for some state pı}, 
where q E€ Q and w1, w2 € Q*. 


DEFINITION 2.11.2. |GSM Mapping and «-free GSM Mapping] Given a 
language L, subset of &*, a GSM M = (Q,»%,0,6,q0, F} generates in output the 
language M(L), subset of Q*, called a GSM mapping, which is defined as follows: 


M(L) = {w | (p,w) € 6*(qo, £) for some state p € F and for some x € L}. 


A GSM mapping M (ZL) is said to be e-free iff the GSM M is <-free, that is, for every 
symbol a € X and every state q € Q, if (p,w) € 6(q,a) for some state p € Q and 
some w € Q* then w £ €. 


Note that in this definition the terminology is somewhat unusual, because a GSM 
mapping is a language, while in the mathematical terminology a mapping is a set 
of pairs. 

The language M (L) is the set of all the output words, each of which is generated 
by M while M nondeterministically accepts one of the words of L. Note that, in 
general, not all words of L are accepted by M. 

Thus, given a language L, a generalized sequential machine M: 

(i) performs on L a filtering operation by selecting the accepted subset of L, and 
(ii) while accepting that subset, M generates the new language M(L) (see Fig- 
ure 2.11.3 on page 94). 

Since a GSM is a nondeterministic automaton, for each accepted word of L more 

than one word of M(L) may be generated. 


DEFINITION 2.11.3. [Inverse GSM Mapping] Given a GSM M = (Q, £, Q, 6, 
qo, F) and a language L subset of Q*, the corresponding inverse GSM mapping, 
denoted M~!(L), is the language subset of X*, defined as follows: 


M~\(L) = {x | there exists (p,w) s.t. (p,w) € 6*(qo,z) and p € F and w € L}. 


Since a GSM M is a nondeterministic machine and defines a binary relation in 
D* x Q* (which, in general, is not a bijection from &* to Q*), it is not always the 
case that M(M~!(L)) = M~!(M(L)) = L. 


Given the languages L1 = {a"b"|n > 1} and L2 = {0"10"|n > 1}, in Fig- 
ure 2.11.4 on page 2.11.4 we have depicted the GSM M12 which translates the 
language L1 onto the language L2, and the GSM M21 which translates back the 
language L2 onto the language L1. We have represented the fact that (q,w) € 6(p, a) 
by drawing an arc from state p to state q labeled by a/w. 


Note that the GSM M12 and M21 are deterministic, and thus, they determine a 
homomorphism from the domain language to the range language (see Definition 1.7.2 
on page 27). Note also that M12 accepts a language which is a proper superset of L1. 
Analogously, M21 accepts a language which is a proper superset of L2. 
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yt 


language L language M (L) 


FIGURE 2.11.3. The £-free GSM mapping M(L). The GSM M gen- 
erates the language M(L) from the language L. The word u is not 
accepted by the generalized sequential machine M. Note that when 
the word ug is accepted by M, the two words vı and v2 are generated. 
No word exists in L such that while the machine M accepts that word, 
the empty word € is generated in output. 


M12 M21 


FIGURE 2.11.4. The GSM M12 on the left translates the language 
L1 = {a"b" | n > 1} onto the language L2 = {0"10" | n > 1}. The 
GSM M21 on the right translates back the language L2 onto L1. The 
machine M12 is an e-free GSM, while M21 is not. 


2.12. Closure Properties of Regular Languages 
We have the following results. 


THEOREM 2.12.1. The class of regular languages is closed by definition under: 
(1) concatenation, (2) union, and (3) Kleene star. 


THEOREM 2.12.2. The class of regular languages over the alphabet X is a Boolean 
Algebra in the sense that it is closed under: (1) union, (2) intersection, and (3) com- 
plementation with respect to *. 


Let us now introduce the following definition. 
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DEFINITION 2.12.3. [Reversal of a Language] The reversal of a language 
L C &*, denoted rev(L), is the set {rev(w) |w € L}, where: 


rev(é) =€ 
rev(aw) =rev(w)a for any a € È and any w € d*. 
Thus, rev(L) consists of all words in L with their symbols occurring in the reverse 
order. We say that rev(w) is the reversal of the word w. 
In what follows, for all words w, we will feel free to write w”, instead of rev(w), 


and analogously, for all languages L, we will feel free to write LË, instead of rev(L). 
For instance, if w = abac then w? = caba. 


R 


We have the following closure result. 


THEOREM 2.12.4. The class of regular languages is closed under reversal. 


The class of regular languages is also closed under: (1) (e-free or not e-free) GSM 
mapping, and (2) inverse GSM mapping. The proof of these properties is based on 
the fact that GSM mappings can be expressed in terms of homomorphisms, inverse 
homomorphisms, and intersections with regular sets (see [9, Chapter 11]). 


THEOREM 2.12.5. The class of regular languages is closed under: (i) substitution 
(of symbols by a regular languages), (ii) (e-free or not ¢-free) homomorphism, and 
(iii) inverse (e-free or not ¢-free) homomorphism. 


PROOF. Properties (i) and (ii) are based on the representation of a regular lan- 
guage via a regular expression. The substitution determines the replacement of 
a symbol in a given regular expression by a regular expression. (iii) Let h be 
a homomorphism from © to Q*. We construct a finite automaton M1 accept- 
ing h~'(V) for any given regular language V C Q* accepted by the automaton 
M2 = (Q,Q, 62, q0, F) by defining M1 to be (Q,™,61,q0, F}, where for any state 
q E€ Q and symbol a € X, the value of ôı(q,a) is equal to 63(q,h(a)), where the 
function 05 : Q x Q* — Q is the usual extension which acts on words, of the tran- 
sition function ô> : Q x Q — Q which acts on symbols (see Section 2.1 starting on 
page 29). Indeed, we have to consider 65, rather than ô>, because for some a € X, 
the length of h(a) may be different from 1. In Figure 2.12.1 on page 96 we show the 
automaton M1 which, given 
- the set V = {b” | n > 2} of words accepted by the automaton M2, and 
- the homomorphism h such that h(a) = bb, 
accepts the set h-!(V) = {a" | n> 1}. 


We can use homomorphisms for showing that a given language is not regular as we 
now indicate. 

Suppose that we know that the language 

L= {ab |n>1} 
is not regular. Then we can show that also the language 

N = {01r |n>1} 
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M1: eV) = sah ge Lt 


h(a) = bb 


b 
M2: (G0) (a) V = {b"|n>2} 


FIGURE 2.12.1. The automaton M2 accepts the set V={b" |n>2} 
of words. Given the homomorphism h such that h(a) = 6b, the au- 
tomaton M1 accepts the set h7!(V) = {a” | n > 1}. 


is not regular. Indeed, let us consider the following homomorphisms f from {a, b, c} 
to {0,1}* and g from {a,b,c} to {a, b}*: 


f(a) = 00 
f(b) =1 
f(e) =0 
g(a) =a 
g(b) =b 
gle) =€ 


We have that: g(f~'(N) N atcbt) = {a"b"|n>1}. If N were regular then, since 
regular languages are closed under homomorphism, inverse homomorphism, and 
intersection, also the language {a"b" |n > 1} would be regular, and this is not the 
case. 


2.13. Decidability Properties of Regular Languages 


We state without proof the following decidability results. The reader who is not 
familiar with the concept of decidable and undecidable properties (or problems) 
may refer to Chapter 6. 

For any given regular language L, 
(i) it is decidable whether or not L is empty, and 
(ii) it is decidable whether or not L is finite. 


As a consequence of (ii), we have that for any given regular language L, it is 
decidable whether or not L is infinite. 


For any given two regular languages L1 and L2, it is decidable whether or not 
L1 = L2. This result is based on the fact that given a regular language L, the finite 
automaton M which accepts L and has the minimum number of states, is unique 
up to isomorphism (see Definition 2.7.4 on page 59). Thus, L1 = L2 iff the minimal 
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finite automata M1 and M2 which accept the languages L1 and L2 respectively, are 
isomorphic. 

For any regular grammar G, it is decidable whether or not G is ambiguous, that is, 
whether or not there exists a word w of the language L(G) generated by that gram- 
mar G such that w has at least two different derivations (see also Definition 3.12.1 
on page 155). This decidability result is based on the following facts: 


(i) we may assume, without loss of generality, that the given regular grammar has 
the productions of the form: A — a or A —> aB, 


(ii) we may consider the finite automaton corresponding to the given regular gram- 
mar, and 


(iii) we may generate, using the Powerset Construction (see Algorithm 2.3.11 on 
page 39), the graph of the states which are reachable from the initial state. If in 
that graph there is a path from the initial state to a final state which goes through 
a vertex with at least two states, then the given grammar is ambiguous. That 
Powerset Construction gives us the word u of the language generated by the given 
grammar such that u has at least two different derivations. 


The following example will clarify the reader’s ideas. 

EXAMPLE 2.13.1. Let us consider the grammar with the following productions: 
S cas 

bA 

aB 

b 

bA 

B b 


The state S is the initial state, and the state A the only final state. The graph of 
the reachable states is depicted in Figure 2.13.1 (see page 98), where the final states 
are denoted by double circles. Since the state {5,B} has cardinality 2, we may get 
from {S} to {S, B} and then to the final state {A} into two different ways. Thus, 
the word ab has the following two derivations: 


(i) S—aS — ab 
(ii) S — aB —> ab 


Dn RK WK 
We ats ae. a 


Fact 2.13.2. For any regular grammar G it is possible to derive a regular 
grammar Gə such that: (i) the language L(G) is equal to the language L(G2), and 
(ii) Gə is not an ambiguous grammar. 


PROOF. It is enough to construct a deterministic finite automaton which is 
equivalent to the given grammar G,. This is a simple application of the Power- 
set Construction Procedure (see Algorithm 2.3.11 on page 39). 
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automaton F; automaton F> 


FIGURE 2.13.1. A nondeterministic finite automaton F; and the 
equivalent deterministic finite automaton F obtained by the Pow- 
erset Construction. 


CHAPTER 3 


Pushdown Automata and Context-Free Grammars 


In this chapter we will study the class of pushdown automata and their relation 
to the class of context-free grammars and languages. We will also consider various 
transformations and simplifications of context-free grammars and we will show how 
to derive the Chomsky normal form and the Greibach normal form of context- 
free grammars. We will then study some fundamental properties of context-free 
languages and we will present a few basic decidability and undecidability results. 
We will also consider the deterministic pushdown automata and the deterministic 
context-free languages and we will present two parsing algorithms for context-free 
languages. 


3.1. Pushdown Automata and Context-Free Languages 


A pushdown automaton is a nondeterministic machine which consists of: 

(i) a finite automaton, 

(ii) a stack (also called a pushdown), and 

(iii) an input tape, where the input string is placed. 
The input string can be read one symbol at a time by an input head which can move 
on the input tape from left to right only. At any instant in time the input head is 
placed on a particular cell of the input tape and reads the symbol written on that 
cell (see Figure 3.1.1). 

The following definition introduces the formal notion of a nondeterministic push- 
down automaton. 


DEFINITION 3.1.1. [Nondeterministic Pushdown Automaton] A nondeter- 
ministic pushdown automaton (also called pushdown automaton, or pda, for short) 
M over the input alphabet X is a septuple of the form (Q, £, T, qo, Zo, F, ô) where: 

- Q is a finite set of states, 

- T is the stack alphabet, also called the pushdown alphabet, 

- qo is an element of Q, called the initial state, 

- Z is an element of I which is initially placed at the bottom of the stack and it 
may occur on the stack at the bottom position only, 

- F CQ is the set of final states, and 

- ô is a total function, called the transition function, from Q x (UU {e}) x T to set 
of the finite subsets of Q x I™. 


In what follows, when referring to pda’s we will feel free to say ‘pushdown’, instead 
of ‘stack’, and we will free to write ‘PDA’, instead of ‘pda’. 
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input string & = w,...w, E X” 


The input head moves from left to right 
or it does not move. 


finite automaton 
states: Q 
initial state: qo 
final states: F 


top of stack | ~——___» 


FIGURE 3.1.1. A nondeterministic pushdown automaton with the in- 
put string a. The stack is assumed to grow to the left, and if we push 
on the stack the string Z,...Z,, the new top of the stack is Z1. 


As in the case of finite automata (see Definition 2.1.4 on page 30), also pushdown 
automata may behave as acceptors of the input strings which are initially placed on 
the input tape (see Definition 3.1.7 on page 102 and Definition 3.1.8 on page 102). 

Given a string of &*, called the input string, on the input tape, the transition 
function ô of a pushdown automaton is defined by the following two sequences S1 
and $2 of actions, where by ‘or’ we mean the nondeterministic choice. 


(S1) For every q E€ Q, a € È, Z ET, we stipulate that 6(q,a,Z) = {(m,71),---; 
(dn; Yn) } iff in state q the pda reads the symbol a € © from the input tape, moves 
the input head to the right, and 
- replaces the symbol Z on the top of the stack by the string y, and makes a transition 
to state q1, 

or... or 
- replaces the symbol Z on the top of the stack by the string y, and makes a 
transition to state qn. 


(S2) For every q € Q, Z ET, we stipulate that ô(q,£, Z) = {(q1, %1) - <- (dn, Yn) } 
iff in state q the pda does not move the input head to the right, and 
- replaces the symbol Z on the top of the stack by the string yı and makes a transition 
to state q1, 

or... Or 
- replaces the symbol Z on the top of the stack by the string 7, and makes a 
transition to state qn. 

Note that the transition function ô is not defined when the pushdown is empty, 

because the third argument of 6 should be an element of I. (When the stack is 
empty we could assume that the third argument of ô is e, but in fact, € is not an 
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element of T). If the pushdown is empty, the automaton cannot make any move 
and, so to speak, it stops in the current state. 

Note also that when defining the transition function 6, one should specify the 
order in which any of the strings 71, ..., Yn is pushed onto the stack, one symbol per 
cell. In particular, one should indicate whether the leftmost symbol or the rightmost 
symbol of the strings 71, ..., Yn will become after the push operation the new top of 
the stack. Recall that we have assumed that pushing the string y = 21... Zn_1Zy 
onto the stack, means pushing Z,, then Z,,_;, and eventually Z,, and thus, we have 
assumed that the new top of the stack is Z,. This assumption is independent of the 
way in which we draw the stack in the figures below. Indeed, we may draw a stack 
which grows either ‘to the left’ or ‘to the right’. In Figure 3.1.1 we have assumed 
that the stack grows to the left. Note also that the issue of the order in which any 
of the strings 71, ..., Yn is pushed onto the stack, can also be solved as suggested by 
Fact 3.1.12. Indeed, by that fact we may assume, without loss of generality, that the 
strings 71, ---; Yn consist of one symbol only and so the order in which the symbols 
of the strings should be pushed onto the stack, becomes irrelevant. 


REMARK 3.1.2. When we say ‘pda’ without any qualification we mean a non- 
deterministic pushdown automaton, while when we say ‘finite automaton’ without 
any qualification we mean a deterministic finite automaton. 


DEFINITION 3.1.3. [Configuration of a PDA| A configuration of a pda M = 
(Q, 4,1, qo, Zo, F, 4) is a triple (q, œ, y}, where: 
(i) gE Q, 
(ii) a € &* is the string of symbols which remain to be read on the input tape (from 
left to right), that is, if the input string is w,;...w, and the input head is on the 
symbol wx, for some k such that 1<k<n, then a is the substring Wk... Wn, and 
(iii) y € T* is a string of symbols on the stack where we assume that the top-to- 
bottom order of the symbols on the stack corresponds to the left-to-right order of 
the symbols in y. We denote by Cm be the set of configurations of the pda M. 


We also introduce the following notions. 


DEFINITION 3.1.4. [Initial Configuration, Final Configuration by final 
state, and Final Configuration by empty stack| Given a pushdown automaton 
M = (Q,%,T,q, Zo, £,6), a triple of the form (qo, a, Zo) for some input string 
a € &*, is said to be an initial configuration. 


The set of the final configurations ‘by final state’ of the pda M is 
Fint, = {(qe,7)|q€ F and y € T*}. 

The set of the final configurations ‘by empty stack’ of the pda M is 
Finu = {(4,£,£) |g E€ Qh. 

Given a pda, now we define its move relation. 


DEFINITION 3.1.5. [Move (or Transition) and Epsilon Move (or Epsilon 
Transition) of a PDA] Given a pda M = (Q,»%,T, qo, Zo, F, ô}, its move relation 
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(or transition relation), denoted —> m, is a subset of Cu x Cm, where Cm is the set 
of configurations of M, such that for any p,q € Q, a € ©, Z ET, a € &*, and 


py els 
either if (q, y) € ô(p, a, Z) then (p, aa, ZB) >m (4, a, 79) 


or if (q,7) € 6(p, €, Z) then (p, a, 26) >m (4, a, %8) 
(this second kind of move is called an epsilon move or an epsilon transition). 


Instead of writing ‘epsilon move’ or ‘epsilon transition’, we will feel free to write 
‘e-move’ or ‘e-transition’, respectively. 

When representing the move relation —>pm we have assumed that the top of the 
stack is ‘to the left’ as depicted in Figure 3.1.1: this is why in Definition 3.1.5 we 
have written Z8, instead of OZ, and y8, instead of Gy. Note that in every move 
the top symbol Z of the stack is always popped from the stack and then the string 
y is pushed onto the stack. 

If two configurations Ci and C> are in the move relation, that is, Ci >m C2, we 
say that the pda M makes a move (or a transition) from a configuration Ci to a 
configuration Cy, and we also say that there is a move from C4 to Co. 

We denote by —%, the reflexive, transitive closure of — m. 


DEFINITION 3.1.6. [Instructions (or Quintuples) of a PDA] Given a pda 
M = (Q,%,0 qo, Zo, F,6), for any p,q E€ Q, any xz € UU {e}, any Z ET, and 
any y € I*, if (p, y) € ôlq, x£, Z) we say that (q, x, Z, p, y) is an instruction (or a 
quintuple) of the pda. An instruction (q, 2, Z, p, y} is also written as follows: 

qx Z > pushy goto p 


When we represent the transition function of a pda as a sequence of instructions, we 
assume that ô(q, x, Z) = {} if in that sequence there is no instruction of the form 
qxz Z | pushy goto p, for some y € I* and p E€ Q. 


DEFINITION 3.1.7. [Language Accepted by a PDA by final state] An input 
string w is accepted by a pda M by final state iff there exists a configuration C € 
Find, such that (qo, w, Zo) >, C. The language accepted by a pda M by final state, 
denoted L( M), is the set of all words accepted by the pda M by final state. 


Note that after accepting a string by final state, the pushdown automaton may 
continue to make a finite or an infinite number of moves according to its transition 
function 6, and these moves may go through final and/or non-final states. 


DEFINITION 3.1.8. [Language Accepted by a PDA by empty stack| An 
input string w is accepted by a pda M by empty stack iff there exists a configuration 
C € Fin§, such that (qo, w, Zo) i, C. The language accepted by a pda M by empty 
stack, denoted N(M), is the set of all words accepted by the pda M by empty stack. 


Note that after accepting a string by empty stack, the transition function ô is 
not defined and the pushdown automaton cannot make any move. 

Note also that, with reference to the above Definitions 3.1.7 and 3.1.8, other 
textbooks use the terms ‘recognized string’ or ‘recognized language’, instead of the 
terms ‘accepted string’ or ‘accepted language’, respectively. 
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REMARK 3.1.9. [Input String Completely Read] When an input string is 
accepted, either by final state or by empty stack, that input string should be com- 
pletely read, that is, before acceptance either the input string is empty or there 
should be a move in which the transition function 6 takes as an input argument the 
rightmost character of the input string. On the contrary, if the transition function ô 
has not yet taken as an input argument the rightmost character of the input string, 
we will say that the input string has not been completely read. 


THEOREM 3.1.10. [Equivalence of Acceptance by final state and by empty 
stack for Nondeterministic PDA’s| (i) For every pda M which accepts by final 
state a language A, there exists a pda M’ which accepts by empty stack the same 
language A, that is, L(M) = N(M’). (ii) For every pda M which accepts by empty 
stack a language A, there exists a pda M’ which accepts by final state the same 
language A, that is, N(M) = L(M’). 


PROOF. (i) Let us consider the pda M = (Q,%,T,6,q0, Zo, F} and the language 
L(M) it accepts by final state, we construct the M’ such that L(M) = N(M’) as 
follows. We take M’ to be the septuple (Q', £, IT”, ô’, qb, Zo, F), where Q’ = QU 
{d%,de} and qj and qe are two new, additional states not in Q. The state qj is the 
initial, non-final state of M’ and qe is a non-final state. We also consider a new, 
additional stack symbol $ not in I, that is, I” = I U {$}. The transition function 
0’ is obtained from ô by adding to 6 the following instructions (we assume that the 
top of the stack is ‘to the left’, that is, when we push Zo $ then the new top is Zo): 


(1) q € Zo — push Zo$ goto qo 

(2) for each final state q € F of the pda M, and for each Z ET U {$}, 
q E€ Z | push € goto qe 

(3) for each Z ET U {$}, 
de E Z | push € goto qe. 


The new symbol $ is a marker placed at the bottom of the stack of the pda M”. 
That marker is necessary because, otherwise, M’ may accept a word because of the 
stack is empty, while for the same input word, M stops because its stack is empty 
and it is not in a final state (thus, M does not accept the input word). Indeed, let 
us consider the case where the pda M, reading the last input character, say a, of an 
input word w, (1) makes a transition to a non-final state from which no transitions 
are possible, and (2) by making that transition, it leaves the stack empty. Thus, M 
does not accept w. In that case the pda M’ when reads that character a, also leaves 
the stack empty if $ were not on the stack and, thus, M’ accepts the word wi’. 
We leave it to the reader to convince himself that L(M) = N(M’). 

(ii) Given a pda M = (Q,%,T,6, qo, Zo, Ø) and the language N(M) it accepts by 
empty stack, we construct the M’ such that LUM’) = N(M) as follows. We take 
M’ to be the septuple (Q U {q, ar}, E, PU {$}, 6, q0, $, {ar }), where qj and qr are 
two new states, and $ is a new stack symbol. The transition function 0’ is obtained 
from 6 by adding to 6 the following instructions (we assume that the top of the stack 
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is ‘to the left’, and thus, for instance, if we push Zo $ onto the stack then the new 
top symbol is Zo): 

(1) q € $ — push Zo$ goto qo 

(2) for each q € Q, 

q € $ +> push € goto qp. 

Instruction (1) causes M’ to simulate the initial configuration of M, but the new 
symbol $ is placed at the bottom of the stack of the pda M’. If M erases its 
entire stack, then M’ erases its entire stack with the exception of the symbol $. 


Instructions (2) cause M’ to make a transition to its unique final state qf. We leave 
it to the reader to convince himself that N(M) = L(M’). 


We have the following facts. 


Fact 3.1.11. [Restricted PDA’s with Acceptance by final state. (1)] For 
any nondeterministic pda which accepts a language L by final state there exists an 
equivalent nondeterministic pda which (i) accepts L by final state, (ii) has at most 
two states, and (iii) makes no ¢-moves on the input [9, page 120]. 


Fact 3.1.12. [Restricted PDA’s with Acceptance by final state. (2)| For 
any nondeterministic pda which accepts by final state, there exists an equivalent 
nondeterministic pda which accepts by final state, such that at each move: 

- either (1.1) it reads one symbol of the input, or (1.2) it makes an ¢-move on the 
input, and 

- either (2.1) it pops one symbol off the stack, or (2.2) it pushes one symbol on the 
stack, or (2.3) it does not change the symbol on the top of the stack [9, page 121]. 


Fact 3.1.13. [Restricted PDA’s with Acceptance by empty stack] For 
any nondeterministic pda which accepts a language L by empty stack, there exists 
an equivalent nondeterministic pda which (i) accepts L by empty stack, and (ii) if 
€ E€ L then it makes one ¢-move on the input (this -move is necessary to erase the 
symbol Zo from the stack) else it makes no ¢-moves on the input [8, page 159]. 


In the following theorem we show that there is a correspondence between the set 
of the S-extended type 2 grammars whose set of terminal symbols is X, and the set 
of the nondeterministic pushdown automata over the input alphabet X. 


THEOREM 3.1.14. [Equivalence Between Nondeterministic PDA’s and 
S-extended Type 2 Grammars] (i) For every S-extended type 2 grammar which 
generates the language L C X*, there exists a pushdown automaton over the input 
alphabet © which accepts L by empty stack, and (ii) vice versa. 


PROOF. Let us show Point (i). Given a context-free grammar G = (Vr, Vn, P, S}, 
the nondeterministic pushdown automaton which accepts by empty stack the lan- 
guage L(G) generated by G, is the septuple of Figure 3.1.2 where 6 is defined as 
indicated in that figure. Note that, similarly to the case of finite automata (see 
page 31), if we want to get a transition function ô which is a total function, it may 
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pda: acceptance by empty stack 


({q0,a} 5 Vr, VnUVrU{Zo} , qdo, Zo, {t}, ô) 


set Q of set X of set I of initial symbolat set F transition 
states input stack state the bottom of final function 
symbols symbols of the stack states 


new top new 
of stack state 
l l 
— push S Zo goto qı (initialization) 
> push Z1... Zęķą gotog, foreach A —> Z,...Z, in P 
—> push € goto qı for each a in Vr 
—> push € goto qı 


FIGURE 3.1.2. Above: the pda of Point (i) of the proof of Theo- 
rem 3.1.14 on page 104. It accepts by empty stack the language gen- 
erated by the S-extended context-free grammar (Vr, Vy, P, SY. Be- 
low: the transition function 6 of that pda. In the instructions of type 
(2) the string Z1... Z may also be the empty string £. For the notion 
of acceptance by final state, see Remark 3.1.16 on page 106. 


be necessary: (i) to add to that pda a non-final sink state q, E€ Q—F, and (ii) to 
consider some additional instructions, besides those listed in Figure 3.1.2, each of 
which is of the form: 

either q; a Z — push 8 goto qs for some g;EQ, a E€ UU{e}, ZET, and Gel" 
or qs a Z +— push 8 goto q; for some a € NU{e}, ZET, and Gel”. 


Note that the pushdown automaton of Figure 3.1.2 is nondeterministic because we 
may have more than one instruction of type (2) (see Figure 3.1.2) for the same 
nonterminal A. 

The reader may convince himself that given any context-free grammar G, the 
pda defined as we have indicated above, accepts by empty stack the language L(G). 

Let us show Point (ii). Given a pda M = (Q, £,T qo, Zo, F,6) which accepts 
by empty stack the language N(M), the context-free grammar G = (Vr, Vy, P, S) 
which generates the language L(G) = N(M) is defined as follows: 


Vy = {S} U {[q Z q'] | for each q,q' E€ Q and each Z ET} 
Vr=d 
together with the following set P of productions: 
(ii.1) for each q E€ Q, S — [qoZoq|, and 
(ii.2) for each q, qi,---,@m4+1 E Q, for each a € NUf{e}, for each A, By,...,Bne T, 
for each (q1, By Bo... Bm) € ô(q, a, A), 


lq A dm+1] 7 a lq Bı q2] [q2 Bə q3] tae [dm Bm qm+1] 
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In particular, if m = 0, that is, (q,¢) € (q,a, A), for some q,qı E Q, a E UU {e}, 
and A ET, then the production to be inserted into the set P is: [gAq] > a. 
Since the pushdown automaton accepts by empty stack, without loss of general- 
ity, we may assume that the set F of the final states is empty. 
Note that when a leftmost derivation for the grammar G has generated the word: 


© [q1 Z1 q2] [q2 Z2 q3] --- [de Zr dr+1] 
- the pda has read the initial substring x € V7 from the input tape, 
- the pda is in state qı, 
- the stack of the pda holds 7,2... Z, and the new top of the stack is Z1, and 
- it is guessed that the pda will be in state q2 after popping Zı and ... and in state 
qk+1ı after popping Zp. 

We have that the leftmost derivations of the grammar G simulate the moves 
of M. The formal proof of this fact can be done in the following two steps (the 
details are left to the reader). 

Step (1). We first prove by induction on the number of moves of M that: for all 
states q,p € Q, for all symbols A € I, and for all sentential forms x which are 
generated from the start symbol according to the grammar G, 


Ap] >er iff (q, A, £) iy (p, 6, €). 
Step (2). Then we have that: 
w € L(G) 
if S — [qo Zoq] ~Gw for someq EQ 
iff (qo, w, Zo) 7, (q,£,€}) for some q E Q 
iff we N(M). 
This concludes the proof. 0 


Theorem 3.1.14 holds also if acceptance is by final state and not by empty stack, 
because of Theorem 3.1.10. Thus, as a consequence of Theorems 3.1.14 and 3.1.10, 
we have the following fact. 


FACT 3.1.15. [Equivalence Between PDA’s and Context-Free Languages] 
Every context-free language can be accepted by a nondeterministic pda either by 
final state or by empty stack, and every nondeterministic pda accepts either by final 
state or by empty stack a context-free language. 


REMARK 3.1.16. [Acceptance by final state] If we change Figure 3.1.2 on 
page 105 by considering Q = {qo,@,q2}, F = {q2}, and the instruction (4) of the 
form: 


(4') qi € Zo — push Zo goto q2 


then the pda of Figure 3.1.2 accepts the context-free language generated by the 
grammar G by final state (and not by empty stack). Note that in the instruction (4’), 
instead of ‘push Zo’, we may also write: ‘push 7’ for any y € I*, because acceptance 
depends on the state, not on the symbols in the stack. 
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Similarly, in the proof of Point (ii) of Theorem 3.1.14 on page 105, if the pda M 
accepts by final state (not by empty stack) we need to add to the set P of productions 
also the following ones, besides those of Points (ii.1) and (ii.2): 


(ii.3) foreachqe F, Z ET, d EQ, [¢Zq'|—e 

EXAMPLE 3.1.17. The nondeterministic pda which accepts by final state the lan- 
guage {w w? | we {0, 1}*} is given by the following septuple: 

({q0, 21; q2}, {0, 1}, {Zo, 0, 1}, qo, Zo, {qo}, ô) 


where ô is defined as follows (we assume that the top of the stack is ‘to the left’, 
and thus, for instance, if we push 0 Z onto the stack then the new top symbol is 0): 


qo 0 Zo > push 0Zo goto qo 

gol Zo > push 1%Zọ goto qo 

qg 00 +> push 00 gotog or pushe goto qı 
qo0 1 +— push 01 goto qo 

do 11 +— push 11 goto qo or pushe goto qı 
qo 10 + + push 10 goto qo 

q 00 > push e goto qı 

qı 11 w push e€ goto qı 

qo € Zo => push Zo  gotog (f) 

qı E Zo > push Zo goto q2 


In the definition of ô, we have written the expression 
qa Z — push 7; goto qı or... or push Yn goto qn 
to denote that 


ô(q,a, Z) T {4,71}, Eaa) (an, n) J- 


The state qı represents the state where the nondeterministic pda behaves as if the 
middle of the input string has been already passed. The instruction (f) is for the 
case where w w? = e. 

The transition function ô can be represented in a pictorial way as indicated in 


Figure 3.1.3 where the arc: 
— 


denotes the instruction: qi xy '— pushw_ goto qj. xis the symbol read from 
the input and y is the symbol on the top of the stack. We assume that after pushing 
the string w onto the stack, the leftmost symbol of w becomes the new top of the 
stack. An analogous notation will be introduced on page 209 for the transition 
functions of (iterated) counter machines. 


The following example shows the constructions of the pda and the context-free 
grammar we have indicated in the proof of Points (i) and (ii) of Theorem 3.1.14 
above. 
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FIGURE 3.1.3. The transition function of a nondeterministic pda 
which accepts by final state the language {w w® | w € {0,1}*}. x and y 
stand for 0 or 1. Thus, for instance, the arc labeled by ‘x,y xy’ stands 
for four arcs labeled by: (i) ‘0,0 00’, (ii) ‘0,1 OT’, (iii) ‘1,0 10’, and 
(iv) ‘1,1 11’, respectively. 


EXAMPLE 3.1.18. Let us consider the grammar G whose set of production is 
the singleton {S — £}. The language it generates is the singleton {£e}, that is, the 
language consisting of the empty word only. As indicated in Point (i) of the proof of 
Theorem 3.1.14, the pda, call it M, which accepts by empty stack the language {e€} 
has the following transition function 6 (we assume that the top of the stack is ‘to 
the left’, and thus, for instance, if we push S Zo onto the stack then the new top 
symbol is S): 

qo E Zo +> push SZ goto qı 
geS —> push e€ goto qı 
qı E Zo — push e€ goto qı 
Now, as indicated in the proof of Point (ii) of Theorem 3.1.14, the context-free 
grammar which generates the language accepted by empty stack by the pda M, has 
the following productions: 
S — [go Zo 40] 
S — [g 20%] 
qo Zogo] — [qı S qo] [a0 Zo 40] 
qo Zogo] => laSa] [q Zo 40] 
Wil | 
] I ] 


qoZoqı] —> [qS qo] [qo Zon 
qo Zog] > aSa] [q Zoq 
qsqa] > € 
qı oq] > € 


By eliminating ¢-productions, unit productions and useless symbols, we get, as ex- 
pected, the production S — € only. 0 


REMARK 3.1.19. If we assume that the grammar G = (Vr, Vy, P, S) is in Greibach 
normal form (see Definition 3.7.1 on page 133), the pda M which accepts the lan- 
guage L(G) by empty stack, can be constructed as follows: ({qo, q1}, Vr, Vr U Vn U 


3.1. PUSHDOWN AUTOMATA AND CONTEXT-FREE LANGUAGES 109 


{Zo}, qo, Zo, 0, 6), where 6 is given by the following instructions (we assume that 
the top of the stack is ‘to the left’, and thus, for instance, if we push S' Zo onto the 
stack then the new top symbol is S): 


qdo E Zo +> push SZ gotog 


gq aA —> push y goto qı for each production A — ay 
q ES — push e goto qı if the production S — € is in P 
qı E Zo — push e€ goto qı 


EXAMPLE 3.1.20. Given the grammar G with the axiom S and the following 
productions in Greibach Normal form: 
S-aSBicle 
B-b 
Now we list the instructions which define the transition function 6 of the pda 
({qo; a}, {a, b, c}, {a, b, C, S, B, Zo}, qo, Zo; 0, ô) 
which accepts L(G) by empty stack (we assume that the top of the stack is ‘to the 


left’, and thus, for instance, if we push S Zo onto the stack then the new top symbol 
is S): 


qdo €E Zo > push S Zo gotog 
qas — push SB gotog (x) 
qecs — push e goto q 
q ES +> push e goto qı (x) 
qb B ++ push e goto qı 
qı E Zo — push e€ goto qı 


Note that the instructions marked by (*) show that the pda is nondeterministic. 


The context-free languages are sometimes called nondeterministic context-free lan- 
guages to stress the fact that they are the languages accepted by nondeterministic 
pda’s. In the following Section 3.3 we will introduce: (i) the deterministic context- 
free languages which constitute a proper subclass of the context-free languages, 
and (ii) the deterministic pda’s which constitute a proper subclass of the nonde- 
terministic pda’s. Deterministic context-free languages and deterministic pushdown 
automata are equivalent in the sense that, as we will see below, the deterministic 
context-free languages are the languages accepted (by final state) by deterministic 
pda’s. 

Note that it is important that the input head of a pushdown automaton cannot 
move to the left. Indeed, if we do not keep this restriction the computational power 
of the pda’s increases as we now illustrate. 


DEFINITION 3.1.21. [Two-Way Nondeterministic Pushdown Automaton] 
A two-way pda, or 2pda for short, is a pda where the input head is allowed to move 
to the left and to the right, and there is a left endmarker and a right endmarker on 
the input string. 
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The computational power of 2pda’s is increased with respect to the usual (one- 
way) pda’s. Indeed, the language L = {0"1"2"|n>1} which is a context-sensitive 
language can be accepted by a 2pda as follows |9, page 121]. The accepting 2pda 
checks that the left portion of the input string is of the form 0”1” by reading the 
input from left to right, and pushing the n 0’s on the stack and then popping one 
0 for each symbol 1 occurring in the input string (by applying the technique shown 
in Example 3.3.12 on page 120). Then, it moves to the left on the input string at 
the beginning of the substring of 1’s (doing nothing on the stack). Finally, it checks 
that the right portion of the input is of the form 1”2” by pushing the n 1’s on the 
stack and then popping one 1 for each symbol 2 occurring in the input string. 

Note that the language L cannot be accepted by any pda because it is not 
a context-free language (see Corollary 3.11.2 on page 152) and pda’s can accept 
context-free languages only (see Theorem 3.1.14 on page 104). 


Before closing this section we would like to introduce the class LIN of the linear 
context-free languages and relate that class to a subclass of the pda’s [9, page 105]. 


DEFINITION 3.1.22. [Linear Context-Free Grammar] A context-free gram- 
mar is said to be a linear context-free grammar iff the right hand side of each 
production has at most one nonterminal symbol. A language generated by a linear 
context-free grammar is said to be a linear context-free language (see also Defini- 
tion 7.6.7 on page 228). The class of linear context-free languages is called LIN. In 
particular, we allow productions of the form A — €, for some nonterminal symbol A. 


Note that the language {a"b"”|n > 0} can be generated by the linear context-free 
grammar with axiom S and the following productions: 


S — aT 
T — Sb 
Sce 
Since the language {a"b"|n > 0} cannot be generated by a regular grammar, we 


have that the class of languages generated by linear context-free grammars properly 
includes the class of languages generated by regular grammars. 


DEFINITION 3.1.23. [Single-turn Nondeterministic PDA] A nondetermin- 
istic pda is said to be single-turn iff for all configurations (qo, Qo, Zo), (q1, 1; %1), 
(q2, a2, Y2), and (q3, a3, Y3), we have that if (qo, @0, Zo) >* (q1, a1, V1) >* (q2, a2, V2) 
—* (q3, a3, 73) and |y1| > [2| then |y2| > |y3| (that is, when the content of the stack 
starts decreasing in length, then it never increases again). 


THEOREM 3.1.24. [Equivalence Between Linear Context-Free Languages 
and Single-Turn Nondeterministic PDA’s| A language is a linear context-free 
language iff it is accepted by empty stack by a single-turn nondeterministic pda iff 
it is accepted by final state by a single-turn nondeterministic pda |9, page 143]. 


In Section 6.4 starting on page 205, we will mention some undecidability results 
for the class of linear context-free languages. 
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3.2. From PDA’s to Context-Free Grammars and Back: Some Examples 


In this section we present some examples in which we show how one can construct: 
(i) given a context-free grammar, a pushdown automaton which is equivalent to that 
grammar, and 
(ii) given a pushdown automaton, a context-free grammar which is equivalent to 
that pushdown automaton. 

We will consider both the case of acceptance by final state and the case of 
acceptance by empty stack. 

For the reader’s convenience we recall here some assumptions that we make on 
any given pda M = (Q,»,T, qo, Zo, F, ô): 
(i) initially, only the symbol Zo is on the stack, 
(ii) acceptance either by final state or by empty stack can occur only if the input 
is completely read, that is, the remaining part of the input string to be read (see 
Remark 3.1.9 on page 103) is the empty string £, and 
(iii) if the stack is empty then no move is possible. 


Recall also that we assume that when a pda makes a move and replaces the top 
symbol of the stack, say A, by a string a, then the leftmost symbol of the string a 
is the new top of the stack. 


EXAMPLE 3.2.1. [From Context-Free Grammars to PDA’s Which Accept 
by final state or by empty stack] Given the context-free grammar G with axiom S 
and the following productions: 

S—aSb|cle 
we want to construct a pushdown automaton which accepts by final state the lan- 
guage generated by G, that is, {a"cb” |n > O}U{a"b" |n > 0}. We use the technique 
indicated in the proof of Theorem 3.1.14 on page 104 and Remark 3.1.16 on page 106. 
We construct a pda with three states: qo, q1, and q2. The state qo is the initial state 
and the set of final states is the singleton {q2}. The transition function ô of the 
pda is as follows (we assume that the top of the stack is ‘to the left’, and thus, for 
instance, when we push the string SZ, onto the stack, we assume that the new top 
symbol of the stack is S): 


0(qo,€; Zo) 7 {(q1, SZo)} 


0(m,€, 9) = (a, a Sb), (q1, €), (q, €)} 
6(m,4,@) = {(q,£)} 

ôq, b,b) = {(u,€)} 

ô(qı, C, c) = {(q1,€)} 

0(q1,€, Zo) z {(q2, Zo)} 


Note that in this last defining equation for ô it is not important whether or not we 
push Zo or any other string onto the stack. 

Instead of a pda with three states, we may use a pda with two states, called qo 
and qı, as we now indicate. We assume that acceptance is by final state and the 
only final state is qı. The transition function ô for this pda with two states is the 
following one, where $ is a new stack symbol: 
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ô(qo, €, Zo) = {(qo, 5 $)} 

ô(qo, E€, S) = {(qo, a Sb), (qo, c), (qo, £)} 
(qo, a, a) = {(qo; e)} 

ô(qo, b, b) = {(q0,£)} 

ô(qo, c, €) z {(4o, e)} 


(qo, E, $) m {(d; Zo)} 


We leave it to the reader to show that this definition of 6 is correct. If we replace $ 
by Zp the definition of 6 would not be correct and the pda would accept by final 
state words which are not in the language generated by the grammar G. One such 
word is abab. 

Note that in the last defining equation for 6 it is not important whether or not we 
push Zp or any other string onto the stack. Note also that the transition function ô 
is not defined when the pda is in state qı. 

If we use acceptance by empty stack, this last pda may be simplified and reduced 
to a pda with one state only, as follows: 


0(q0, €, Zo) T {(qo, 5) } 
(qo,€,5) = {(qo, a Sb), (Go,c), (Go,€)F 
6(qo,a,a) = {(qo,€)} 
( = {(q0,€)} 
ô(qo, c, ©) = {(qo, e)} 


Again we leave it to the reader to show that this definition of ô is correct. Note in 
the first move the symbol S replaces Zp at the bottom position of the stack. 


EXAMPLE 3.2.2. [From PDA’s Which Accept by empty stack to Context- 
Free Grammars] Let us consider the pda with one state described at the end of 
the previous Example 3.2.1. It accepts by empty stack the language generated by 
the grammar G whose productions are: 


S—aSb|cle 
The context-free grammar corresponding to that pda as indicated in the proof of 
Theorem 3.1.14 on page 104, has the following productions (see the proof of Theo- 


rem 3.1.14): 
S — [qo Zo qo] 
qo Zo qo] — [go 5 qo] 
qo S qo] > [qo a qo] [qo S qo] [qo b qo] 
qo S qo] =e [do (6 qo] 
qo S qo] SE 
do a qo] >a 
qobqo] —> b 
qo C qo] E 


By suitable renaming of the nonterminal symbols we get: 
S— R 
R-T 
ToATB|C le 
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A—-a 
Boob 
Coe 


and, by unfolding the nonterminal symbols A, B,C, R, and T on the right hand 
sides, and by eliminating useless symbols and their productions, we get: 
S—-aTblicle 
T—-aTbicle 
Since S generates the same language as T (and this can be proved by induction 


on the length of the derivation of a word of the language), we can eliminate the 
productions for T and replace T by S in the productions for S. By doing so, we get: 


S—-aSblicle 


As one might have expected, these productions are those of the grammar G. 


EXAMPLE 3.2.3. [From PDA’s Which Accept by final state to Context- 
Free Grammars| Let us consider the following pda M with three states: qo, qı, 
and q2. The state go is the initial state and the set of final states is the singleton 
{q2}. The input alphabet Vr is {a,b,c}. The transition function ô of the pda is as 
follows (we assume that the top of the stack is ‘to the left’, and thus, for instance, 


when we push the string SZ onto the stack, we assume that the new top symbol of 
the stack is S): 


ô(qo, E, Zo) 
ò q1, €, S) 


{(q1; SZo)} 

( = {(a, a Sb), (q1, c}, (q1, €)} 
d(m,a,a) = {(n, } 
lq, b,b) = {( 

{(n, J} 


6(q1, c, c) 
ô q1, £, Zo) a { (42, Zo) } 
As we have seen in Example 3.2.1 on page 111, the pda M accepts by final state all 
words which are generated by the context-free grammar with axiom S and whose 
productions are: 


SaSbicle 


We can construct a context-free grammar, call it G, which generates the same lan- 
guage accepted by final state by the pda M by applying the techniques indicated 
in the proof of Point (ii) of Theorem 3.1.14 on page 105 and in Remark 3.1.16 on 
page 106. 

The nonterminal symbols of the context-free grammar G are: S, which is the ax- 
iom of G, and the 45 symbols which are of the form |q s q'], for any q, q’ € {q0, q1, q2} 
and s € {S, Zo, a,b, c}, that is, 


[go S gol, [oS qi], [qo S q2], [qo Z0 G0], [qo Zogi], [qo Z0 GI, ---, [qo gl, 
a S qol, sey [a © Ql, 
lq2 S qo], aia ag [qo € qo]. 
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qo S Qo 
a Zo 4 


We will collectively indicate those 45 symbols by the matrix | qo a q2 | and 
b 
c 


in that matrix every path from left to right denotes a nonterminal symbol of the 
grammar G. (Obviously, in that matrix there are 3 x 5x3 = 45 paths for every 
possible choice of the first, second, and third component.) In what follows we will 
use that matrix notation also for denoting the productions of the grammar G as we 
will indicate. 

These productions of the grammar G are the following ones: 


do 
1. S—>/q@2Z0% 


q2 


which in our matrix notation, by considering every path from left to right, denotes 
the three productions: 


1.1 S = [qo Zo qo] 
1.2 S — [qo Zo q] 
1.3 S— [qo Zo q2]. 


Then we have the following production: 


qdo do do do 
2. qo Zo a | >c Sh qı Zo 4 
q2 q2 q2 q2 


(a) (6) (8) (a) 


This production 2 denotes the following nine productions 2.1-2.9 in our matrix 
notation where the choices marked by the same Greek letter should be the same: 

2.1 [qo Zo qo] —> [q1 £ qol [qo Zo qo 
2.2 [qo Zo qo] > [qı S qı] [G1 Zo Go 
2.3 [qo Zo do] > [qı S q2] [G2 Zo Go 


2.9 [qo Zoq2] — [qı S qa] [q2 Zo Ga 


We also have the following productions (again here and in what follows the choices 
marked by the same Greek letter should be the same): 


qo qo do do qo qdo 
31 |a S q |>ecjaaq qı S qı qıb qı | (27 productions) 
q2 q2 q2 Q2 q2 Q 


(a) (@) (8) 0) @ (a) 
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qo do 
3.2 [a Sa |—7e}qaeu | (3 productions) 
q2 q2 


(a) (a) 


6 eg) 
do do 
7. qı Zo qı | >£ |G Zo qı | (3 productions) 
q2 q2 
(a) (a) 
S 
Zo % 
8. qg a qı | —e (15 productions) 
b @ 
c 


These last fifteen productions 8 are required according to Remark 3.1.16 on page 106 
because the acceptance of the given pda is by final state. 

Now we will check that, indeed, all the productions 1-8 generate all words which 
are generated from the axiom S by the productions: 


S—asSbj|cele 


First note that in the productions 3.1 the choice qı only can produce words in 
{a,b,c}* (see, in particular, the productions 3.3, 4, 5, and 6). This fact can be 
derived by applying the From-Below Procedure which we will present later (see 
Algorithm 3.5.1 on page 123). Thus, we can replace the productions 3.1 by the 
following one: 


31 [u Sq) (nan) [aS q] lg bq] 


Analogously, in the productions 3.2 the choice qı only can produce words in {a, b, c}* 
(see the productions 6). Thus, we can replace the productions 3.2 by the following 
one: 


3.2 ln S qı] =>} [qı cq] 


In the productions 2, the only possible choice for the position (8) is qı, because 
[qı S qo] and [qı S q2] cannot produce words in {a, b, c}* (recall that we have already 
shown that the productions 3.1 can be replaced by the production 3.1’). Thus, we 
can replace the productions 2 by the following three productions: 
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do do 
Y | 20un | > [qa Sq] |q Zo 4 
q2 q2 


(a) (a) 


do 
By unfolding the productions 7 with respect to |q2 Zo qı | (see productions 8), we 
q2 
get: 
do 
T. 1a Zo G | >€ (8 productions) 
q2 
do 
By unfolding the productions 2’ with respect to |q Zo qı | (see productions 7’), we 
q2 
get the following three productions: 
gu qo 
` qo Zo ti | > [un Sa] 
q2 
do 
By unfolding the productions 1 with respect to | qo Zo qı | (see productions 2”), we 
q2 


get the following production: 

VW S —> [q Sq] 

At this point we have that the productions of the grammar G with axiom S are the 
following ones: 


YS [q Sq] 

31 [aSa] > luaq] aSa] [aba] 
3.2 [qa Sa] > [neq] 

3.3 [a Sq] > E 

4. qaq] —a 

5. qubqı] > b 


6. [qeq] >c 
qo 

7.  |qı Zo qu | >€ (3 productions) 
q2 


Now the productions 7’ can be eliminated because they cannot be used in any 
derivation from the axiom S. This fact can be obtained by applying the From- 
Above Procedure which we will present later (see Algorithm 3.5.3 on page 124). By 
unfolding; (i) the production 1’ with respect to [q1 S qı], (ii) the production 3.1’ with 
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respect to [qı a qı] and [q, bq], and (iii) the production 3.2’ with respect to [qı cq], 
we get the following productions: 


SralmSqjb | ele 

q: Sal >al Sqa]b | c |e 
Since S generates the same language as [q1 S qı] (and this can be proved by induction 
on the length of the derivation of a word of the language), we can eliminate the 
productions for [qi Sq] and replace [qi S qı] by S in the productions for S. By 
doing so, we get as expected, the following three productions: 


S—-aSb|cle 


3.3. Deterministic PDA’s and Deterministic Context-Free Languages 


Let us introduce the notion of deterministic pushdown automaton and deterministic 
context-free language. 


DEFINITION 3.3.1. [Deterministic Pushdown Automaton] A pushdown au- 
tomaton (Q, X, T, qo, Zo, F, 6) is said to be a deterministic pushdown automaton (or 
a dpda, for short) iff 
(i) Yq € Q, YZ ET, if d(q,¢,Z) Æ {} then Va € X, 5(g,a,Z) ={} (that is, no 
other moves are allowed when an e-move is allowed), and 
(ii) Va E€ Q, YZ ET, Va Ee NU {e}, (q, x, Z) is either {} or a singleton (that is, 
if a move is allowed, then that move can be made in one way only, that is, there 
exists only one next configuration for the dpda). 


Thus, a deterministic pda has a transition function 6 such that: (i) for each 
input element in X U {e}, returns either a singleton or an empty set of states, and 
(ii) returns a non-empty set of states for the input £ only if 6 returns the empty set 
of states for all other symbols in ©. 

In what follows, when referring to dpda’s we will feel free to write ‘DPDA’, 
instead of ‘dpda’. 


DEFINITION 3.3.2. [Language Accepted by a DPDA by final state| The 
language accepted by a deterministic pushdown automaton M = (Q, £, T, qo, Zo, F, 6) 
by final state is the following set L of words: 


L = {w | there exists a configuration C € Fin’, such that (qo, w, Zo) >, C}. 


DEFINITION 3.3.3. [Deterministic Context-Free Language] A context-free 
language is said to be a deterministic context-free language iff it is accepted by a 
deterministic pushdown automaton by final state. 


DEFINITION 3.3.4. [Language Accepted by a DPDA by empty stack| The 
language accepted by a deterministic pushdown automaton M =(Q,%,T, qo, Zo, F, ô) 
by empty stack is the following set L of words: 


L = {w | there exists a configuration C € Fin‘, such that (qo, w, Zo) i, C}. 
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Note that when introducing the concepts of the above Definitions 3.3.2 and 3.3.4, 
other textbooks use the terms ‘recognizes’ and ‘recognized’, instead of the terms 
‘accepts’ and ‘accepted’, respectively. 


EXAMPLE 3.3.5. Let w? denote the string obtained from the string w by re- 
versing the order of the symbols. A deterministic pda accepting by final state the 
language L = {wew”|w € {0,1}*}, is the septuple: 

Hao, qi, q2}, {0, 1, c}, {Zo, 0, 1}, qo; Zo, {a2}, ô) 
where the function 6 is defined as follows (here we assume that the top of the stack 
is ‘to the left’, that is, when, for instance, we push 0 Z on the stack then the new 


top is 0): 

for any Z € {Zo, 0, 1}, 
go0Z —> push 0Z goto qo 
golZ > push 17 goto qo 
qgocZ —> push Z goto qı 
q 00 —> push e goto qı 
qg 11 w push e€ goto qı 
qı E Zo > push Zo goto q 


Recall that acceptance by final state requires that: (i) the state q is final, and 
(ii) the input string has been completely read. We do not care about the symbols 
occurring in the stack. 0 


There are context-free languages which are nondeterministic in the sense that they 
are accepted by nondeterministic pda’s, but they cannot be accepted by determin- 
istic pda’s. 
The language L = {ww |w € {0,1}*} of Example 3.1.17 on page 107 is a con- 
text-free language which is not a deterministic context-free language [9, page 265]. 
Also the language L = {a"b"|n > 1} U{a"b?" |n > 1} is a context-free language 
which is not a deterministic context-free language (|3, page 717] and [9, page 265]). 


FACT 3.3.6. [Restricted DPDA’s Which Accept by final state| For any de- 
terministic pda which accepts by final state, there exists an equivalent deterministic 
pda which accepts by final state, such that at each move: 

- either (1.1) it reads one symbol of the input, or (1.2) it makes an -move on the 
input, and 

- either (2.1) it pops one symbol off the stack, or (2.2) it pushes one symbol on the 
stack, or (2.3) it does not change the symbol on the top of the stack, and 

- if it makes an ¢-move on the input then in that move it pops one symbol off the 
stack |9, pages 234 and 264]. 


FACT 3.3.7. |DPDA’s Which Accept by final state Are More Powerful 
Than DPDA’s Which Accept by empty stack] (i) For any deterministic pda M 
which accepts a language L by empty stack there exists an equivalent deterministic 
pda M1 which accepts L by final state, and (ii) for any deterministic pda M1 which 
accepts a language L by final state it may not exist an equivalent deterministic pda 
M which accepts L by empty stack. 
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PROOF. (i) The proof of this point is like that of Theorem 3.1.10 on page 103. 
(ii) Let us consider the language 


E = {w|w e {0,1}* and in w the number of occurrences of 0’s and 1’s are 
equal}. 


This language is accepted by a deterministic iterated counter machine (see Sec- 
tion 7.1 starting on page 207) with acceptance by final state (see Figure 7.3.3 on 
page 221) and thus, it is accepted by a deterministic pushdown automaton by final 
state. The language E cannot be accepted by a deterministic pushdown automaton 
by empty stack. Indeed, let us assume, on the contrary, that there exists one such 
automaton. Call it W. The automaton M should accept the words 01 and 0101, but 
it should not accept the word 010. This means that the automaton M should have 
its stack empty after reading the input strings 01 and 0101, but its stack should not 
be empty after reading the input string 010. This is impossible, because when the 
stack is empty, M cannot make any move. 


Thus, (i) for nondeterministic pda’s the notion of acceptance by final state and 
by empty stack are equivalent (see Theorem 3.1.10 on page 103), while (ii) for de- 
terministic pda’s the notion of acceptance by final state is more powerful than that 
of acceptance by empty stack. 

Below we will see that, if we assume that the input string is terminated by a 
right endmarker, say $, with $ not in ÈX, then deterministic pda’s with acceptance 
by final state are equivalent to deterministic pda’s with acceptance by empty stack. 


THEOREM 3.3.8. For any deterministic pda M which accepts by final state a 
language L (which, by definition, is a deterministic context-free language), there 
exists an equivalent deterministic pda M1 which accepts by final state the language 
L and for each word w € L, M1 reads the whole input word w (in this case, if w # € 
then the rightmost symbol of w is an element of X, not the special symbol $). After 
performing the complete reading of the input word w (which is always the case if 
w =e), if M1 is in a final state (that is, M1 accepts w) then M1 does not make any 
€-move on the input w. Thus, we can construct M1 so that, if a string w = a,... ax, 
for some k > 1, is accepted by final state by M1, then M1 accepts w immediately 
after applying the transition function 6 which has the rightmost input symbol a, as 
its second argument (see [9, page 265, Exercise 10.7]). 


Notice, however, that there are deterministic context-free languages which are 
accepted by final state by deterministic pda’s which make ¢-moves on the input, but 
they are not accepted by final state by any deterministic pda which cannot make 
é-moves on the input [9, page 265, Exercise 10.6]. 

If e-moves on the input are necessary for the acceptance by final state of a 
deterministic context-free language L by a deterministic pda (that is, there exists at 
least one word in L whose acceptance requires an ¢-move), then by Theorem 3.3.8, 
those -moves on the input are necessary only when the input string has not been 
completely read [9, page 265, Exercise 10.7] (see Remark 3.1.9 on page 103). 
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A deterministic context-free language which is accepted by final state by a de- 
terministic pda which has to make ¢-moves on the input is [9, page 265]: 

Eae = {01ta | i,k>1} U {0 1¥b2¥ |i, k>1}. 
By Theorem 3.3.8 we can construct a deterministic pda which accepts by final state 
the language Fa4ete and makes ¢-moves on the input only when the input has not 
been completely read. 


DEFINITION 3.3.9. |[Prefix-Free Language] A language L is said to be prefiz- 
free (or to enjoy the prefix property) iff no string in L is a proper prefix of another 
string in L, that is, for every string u € L, the string uv for v Æ € is not in L. 


THEOREM 3.3.10. [In the case of DPDA’s the Prefix Property Implies 
the Equivalence of Acceptance by final state and by empty stack| A de- 
terministic context-free language L is accepted by empty stack by a deterministic 
pda iff L is accepted by final state by a deterministic pda and L enjoys the prefix 
property. 


PROOF. First, note that if the strings u and uv, with v different from €, are in 
the deterministic context-free language L, then a deterministic pushdown automaton 
which accepts L by empty stack, after reading u, should: (i) make the stack empty 
for accepting u, and also (ii) make the stack not empty for reading completely uv 
and accepting it (recall that if the stack is empty, then a pda cannot make any 
move, and the notion of acceptance of an input word by empty stack requires that 
the input word has been completely read). The remaining part of the proof is based 
on the constructions indicated in the proof of Theorem 3.1.10 on page 103. 0 


Thus, we have the following fact. 


FACT 3.3.11. |Prefix-Free Context-Free Languages and DPDA’s| If we 
add a right endmarker $ to every input string of a given language L C X*, with 
$ Z X, then we get a language, denoted by L$, which enjoys the prefix property, 
and L$ is accepted by a deterministic pda by final state iff L$ is accepted by a 
deterministic pda by empty stack [9, page 121 and 248]. 


The reader may contrast this result by the one stated in Fact 3.3.7 on page 118. 
Note that the addition of a left endmarker to a given input language does not 
increase the computational power of a deterministic pda, because its input head on 
the input tape moves to the right only. 


EXAMPLE 3.3.12. [Balanced Bracket Language] Let us consider the language 
of balanced brackets, that is, the language L(G) generated by the context-free gram- 
mar G with the following productions: 

S>()|(S) | ss 
This language does not enjoy the prefix property because, for instance, both () and 
() () are words in L(G). A pda accepting by empty stack the language L(G) $ is the 
deterministic pda M given by the following septuple: 


({qo}, {(, ), $}, {1, Zo}, qo; Zo, {} ô) 
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where the function 6 is defined by the following instructions (here we assume that 
the top of the stack is ‘to the left’, and thus, for instance, if we push 1 Zp onto the 
stack then the new top symbol is 1): 


qo ( Zo > push 1Z goto qo 
go ( 1 +— push 11 goto qo 
qo ) 1 m push € goto qo 
qo $ Zo > push e€ goto qo 


We have that w € L(G) iff w$ is accepted by the pda M. Since the language L(G) 
does not enjoy the prefix property, it is impossible to construct a deterministic pda 
which accepts L(G) by empty stack. 

One can construct the grammar G: corresponding to M as indicated in the proof 
of Theorem 3.1.14. We get Gi = ({(, ), $}, {S, [qo Zo Go], [go Lqol}, P, S}, where 


the set P of productions is the following one: 
S — [go Zo qo] 
[Go Zo qo] — ( [G0 1 go] [go Zo qo] 
[Go 1go] — ( [40 1 qo] [go 1 q0] 
lolg] > ) 
[go Zogo] —> $ 


that is, by renaming the nonterminal symbols, 


S—A 
A—(BAJ|$ 
B— (BB ) 


We have that w € L(G) iff w$ € L(G). For instance, for accepting by empty stack 
the input string (())$, the pda M makes the following sequence of moves: 


(go, (())$%, Zo mum (q, ())$, 12o) 
>m (qo, ))$, 112) 


>m (qo, )$, 12o) 
>M (qo, $, Zo) 
>M (qo, E, £) 


EXAMPLE 3.3.13. [Language a* Ua"b"| As the language of Example 3.3.12 on 
page 120, also the language {a”|n>0} U {a"b" | n> 1} is a deterministic context-free 
language which does not enjoy the prefix property. 


3.4. Deterministic PDA’s and Grammars in Greibach Normal Form 


A language generated by a grammar in Greibach normal form in which there 
are no two productions with the same nonterminal symbol on the left hand side 
and the same leftmost terminal symbol on the right hand side, can be accepted by 
a deterministic pushdown automaton and, thus, it is a deterministic context-free 
language. 
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Note, however, that there are deterministic context-free languages such that 
every grammar in Greibach normal form which generates them, should have at least 
two productions of the form: 


Aap, A afro 


for some A € Vy, a € Vr, and 61,62 E€ V*, that is, there should be at least two 
productions such that: (i) they have the same nonterminal symbol on the left hand 
side, and (ii) they have the same leftmost terminal symbol on the right hand side. 

The existence of such deterministic context-free languages follows from the fact 
that when accepting a deterministic context-free language, a deterministic pushdown 
automaton may be forced to make ¢-moves when reading the input string. Indeed, if 
every grammar in Greibach normal form which generates a context-free language L 
is such that for each nonterminal A € Vy, for each terminal a € Vr, there exists at 
most one production of the form A — a p, for some B € Vx, then for every word 
w E L, we can construct a leftmost derivation of w that generates in any derivation 
step one more terminal symbol of w and, thus, no ¢-moves on the input are required 
during parsing. 

As already mentioned on page 120, a deterministic context-free language for 
which every deterministic pushdown automaton which recognizes it, is forced to 
make ¢-moves on the input is: 


Eget = {01ta | i,k>1} S01" bOF | a adh: 
A grammar in Greibach normal form which generates the language Eger, is the one 
with axiom S and the following productions: 


SOLT | OR LoOLT |1A R-0R |1BT 
T—2 A-1A la B-1BT |b 


(Note that the two productions for S have the right hand side which begins by the 
same symbol 0.) The production S — OLT and those for the nonterminals L, A, 
and T generate the language {0° 1" a2! | i,k >1}, while the production S — 0 R and 
those for the nonterminals R, B, and T generate the language {0' 1" b 2” | i,k >1}. 

A deterministic pda M that accepts this language by final state works as follows: 


(i) first, M pushes on the stack the 0’s and 1’s of the input string, and then 


(ii.1) if a is the next input symbol, M pops off the stack all the 1’s (by making 
é-moves) and then checks whether or not the remaining string of the input has as 
many 2’s as the 0’s on the stack, otherwise, 


(ii.2) if b is the next input symbols, M checks whether or not the remaining string 
of the input has as many 2’s as the 1’s on the stack. 

By using the conventions of Figure 3.1.3 on page 108, the pda M can be repre- 
sented as in Figure 3.4.1 on page 123. Recall that M accepts by final state a given 
input string w if M enters a final state and w has been completely read. The pda M 
of Figure 3.4.1 makes -moves on the input only when the input string has not been 
completely read. 

Given an input word w of the form 0 1* a2", for some i,k >1, the stack of the 
pda M, when M enters for the first time the state qa2, has i—1 0’s. Thus, the last 
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symbol 2 of w is read exactly when the top of the stack is Zp (see the arc from qa2 
to doz). In the state qa we pop off the stack all the 1’s which are on the stack. 

Given an input word w of the form 0°1*b2*, for some i,k > 1, the stack of 
the pda M, when M enters for the first time the state qb, has k—1 1’s (besides 
the 0’s). Thus, the last symbol 2 of w is read exactly when the top of the stack is 
the topmost 0 (see the arc from q to q2). 


0, Zo 0 Zo 1,1 


1,0 10 
>| 


1 a oe), 

a 0 27, 

È E T E ® 0 “@) 
ee ee 


FIGURE 3.4.1. The transition function of the deterministic pda M 
that accepts by final state the language Eje,.={0' 1" a2" | i,k>1} U 
{0° 1*b2* | i,k>1}. When pushing on the stack the string ‘nm’, the 
new top of the stack is n. x and y stands for any stack symbol, but y 
cannot be Zp. 


3.5. Simplifications of Context-Free Grammars 


In this section we will consider some algorithms for modifying and simplifying 
context-free grammars while preserving equivalence, that is, keeping unchanged the 
language they generate. The proof of correctness of these algorithms is left to the 
reader. 

3.5.1. Elimination of Nonterminal Symbols That Do Not Generate 
Words. 


Let us consider a context-free grammar G = (Vr, Vy, P, SY}. We construct an equiv- 
alent context-free grammar G” = (Vr, Vx, P’, S} such that: 

(i) Vx, only includes the nonterminal symbols which generate words in V7, that is, 
for all A € Vx, there exists a word w € V7 such that A —>%ġ, w, and 

(ii) P’ includes only the productions whose symbols are elements of Vp U Vý. 

The set Vx can be constructed by using the following procedure called the From- 
Below Procedure. 


ALGORITHM 3.5.1. From-Below Procedure. 
Elimination of symbols which do not generate words. 


Vý := Ø; 
do add the nonterminal symbol A to Vý 


if there exists a production A — a with a € (Vr U Vý)“ 
until no new nonterminal symbol can be added to Vý 
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Then the set P’ of productions is derived by considering every production of P 
which includes symbols in Vp U Vý only. In particular, if A € Vy and A > € isa 
production of P, then A —> e should be included in P”. 


EXAMPLE 3.5.2. Given the grammar G with productions: 
S — XY |a 
X—>a 
by keeping the nonterminals which generate words, we get a new grammar whose 
productions are: 
S-a 
X—>a 


As a consequence of the From-Below Procedure we have the following decision pro- 
cedure for the emptiness of the context-free language generated by a context-free 
grammar G: 

L(G) = 0 iff S € Vý. 
In general, the language which can be generated by the nonterminal A (see Defini- 
tion 1.2.4 on page 11) is empty iff A ¢ Vý. 


3.5.2. Elimination of Symbols Unreachable from the Start Symbol. 
Let us consider a context-free grammar G = (Vr, Vn, P, S). We construct an equiv- 
alent context-free grammar G” = (Vi, Vý, P’, S} such that the symbols in V} U Vý 
can be reached from the start symbol S in the sense that for all x € V} U Vx, there 
exist a, 8 € (VpUV;,)* such that S 3%, aap. 

The sets V} and Vý can be constructed by using the following procedure called 
the From-Above Procedure. 


ALGORITHM 3.5.3. From-Above Procedure. 
Elimination of symbols unreachable from the start symbol. 


Ve; 
Vy = {5}; 
do add the nonterminal symbol B to Vx 
if there exists a production A > aBB with AEC Vy, BE Vn, 
and a, ß € (Vr U Vyn)“; 


add the terminal symbol b to V} 
if there exists a production A > ab8 with A € Vý, b € Vr, 
and a, B E (Vr U Vn)“; 


until no new nonterminal symbol can be added to Vý 


Then the set P’ of productions is derived by considering every production of P 
which includes symbols in V} U Vx, only. In particular, if A € Vy and A —> € isa 
production of P, then A —> £e should be included in P”. 
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EXAMPLE 3.5.4. Let us consider the same grammar G of Example 3.5.2, that is: 


S — XY |a 
X—>a 


If we first keep the nonterminals which generate words, we get (see Example 3.5.2 
on page 124) the two productions S — a and X — a and then, if we keep only 
the symbols reachable from S, we get the production: 


Sa 


Note that if given the initial grammar G, we first keep only the symbols reachable 
from S (which are the nonterminals S, X, and Y, and the terminal a) we get the 
same grammar G and then by keeping the nonterminals which generate words we 
get the two productions S —a and X — a, where the symbol X is useless (see 
Definition 3.5.5 on page 125). 


Example 3.5.4 above shows that, in order to simplify context-free grammars it 
is important to: first, (i) eliminate the nonterminal symbols which do not generate 
words by applying the From-Below Procedure, and then (ii) eliminate the symbols 
which are unreachable from the start symbol by applying the From-Above Proce- 
dure. 


Now we state an important property of the From-Below and From-Above pro- 
cedures we have presented above. 


DEFINITION 3.5.5. [Useful Symbols and Useless Symbols] Given a grammar 
G = (Vr, Vy, P, S} a symbol X € Vr U Vy is useful iff S —>ġ aX@ 6 w for 
some a, € (Vy U Vr)* and w € Vř. A symbol is useless iff it is not useful. 


THEOREM 3.5.6. [Elimination of Useless Symbols] Given a context-free 
grammar G = (Vr, Vy, P, S) by applying first the From-Below Procedure and then 
the From-Above Procedure we get an equivalent grammar without useless symbols. 


Further simplifications of the context-free grammars are possible. Now we will indi- 
cate three more simplifications: (i) elimination of epsilon productions, (ii) elimina- 
tion of unit productions, and (iii) elimination of left recursion. 


3.5.3. Elimination of Epsilon Productions. 


In this section we prove Theorem 1.5.4 (iii) which we stated on page 20. We recall 
it here for the reader’s convenience: 


(iii) For every extended context-free grammar G such that € ¢ L(G), there exists 
an equivalent context-free grammar G’ without ¢-productions. For every extended 
context-free grammar G such that £ € L(G), there exists an equivalent, S-extended 
context-free grammar G”. 

The proof of that theorem is provided by the correctness of the following Algo- 
rithm 3.5.8. Recall that: 
(i) an extended context-free grammar is a context-free grammar where we also allow 
one or more productions of the form: A — e for some A € Vy, and 
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(ii) an S-extended context-free grammar is a context-free grammar where we also 
allow a production of the form: S —> €. 


Let us first introduce the following definition. 


DEFINITION 3.5.7. [Nullable Nonterminal] Given a grammar G, a nontermi- 
nal symbol A is said to be nullable if A —@ €. 


Given an extended context-free grammar G = (Vr, Vn, P, S) we get the equiva- 


lent S-extended context-free grammar by applying the following procedure. In the 
derived S-extended grammar we have the production S —> € iff € € L(G). 


ALGORITHM 3.5.8. Procedure: Elimination of e-productions (different from the 
production S — €). 
Step (1). Construct the set of nullable symbols by applying the following two rules 
until no new symbols can be declared as nullable: 

(1.1) if A —> € is a production in P then A is nullable, 

(1.2) if B — a is a production in P and all symbols in a are nullable 

then B is nullable. 

Step (2). If S is nullable then add the production S —> €. 
Step (3). Replace each production A —> x,...%n, for any n>0, by all productions 
of the form: A —> yı... Yn, where: 

(3.1) (yi = x; or y; = €) for every x; in {£1,..., €n} which is nullable, and 

(3.2) y; = x; for every x; in {21,...,2,} which is not nullable. 
Step (4). Delete all ¢-productions, but keep the production S — €, if it was intro- 
duced at Step (2). 


Note that after the elimination of ¢-productions, some useless symbols may be gen- 
erated as shown by the following example. 


EXAMPLE 3.5.9. Let us consider the grammar with the following productions: 

S—+A 

Ae 
In this grammar no symbol is useless. After the elimination of the ¢-productions we 
get the grammar with productions: 

S—+A 

SE 
where the symbol A is useless and it can be eliminated by applying the From-Below 
Procedure. 


3.5.4. Elimination of Unit Productions. 
We first introduce the notion of a unit production. 
DEFINITION 3.5.10. [Unit Production and Trivial Unit Production] Given 


a context-free grammar G = (Vr, Vyn, P, S}, a production of the form A — B for 
some A,B € Vy, not necessarily distinct, is said to be a unit production. A unit 
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production is said to be a trivial unit production if it is of the form A — A, for 
some A € Vy. 


Let us consider a context-free grammar G = (Vr, Vy, P, S} without ¢-productions. 
We want to construct an equivalent context-free grammar G” = (Vr, Vy, P’, S) 
without unit productions. 

The set P’ consists of all non-unit productions of P together with all productions 
of the form A —> a, if A >* B via unit productions and B — a with |a| > 1 or 
ae Vr. 

One can show that the construction of the set P’ can be done by applying the 
following procedure which starting from the set P, generates a sequence of sets of 
productions, the last of which is P’. 


ALGORITHM 3.5.11. Procedure: Elimination of unit productions. 


Let G = (Vr, Vn, P, S) be the given a context-free grammar without ¢-productions. 
We will derive an equivalent context-free grammar G’ = (Vr, Vy, P’, S) without 
-productions and without unit productions. 


Step (1). We modify the set P of productions by discarding all trivial unit produc- 
tions. Then we consider a first-in-first-out queue U of unit productions, initialized 
by the non-trivial unit productions of P in any order. Then we modify the set P 
of productions and we modify the queue U by performing as long as possible the 
following Step (2). 

Step (2). We extract from the queue U a unit production. It will be of the form 
A— B, with A,B € Vy and A different from B. 


(2.1) We unfold B in A — B, that is, we replace in P the production A — B 
by the productions A — (,|...|@n, where B — (,|... |G, are all the 
productions for B. 

(2.2) Then we discard from P all trivial unit productions. 

(2.3) We insert in the queue U, one after the other, in any order, all the non- 
trivial unit productions, if any, which have been generated by the unfolding 
Step (2.1). 


Note that after the elimination of the unit productions, some useless symbols may 
be generated as the following example shows. 


EXAMPLE 3.5.12. Let us consider the grammar with the productions: 

S—-AS |A 

A-a |B 

Bob |S |A 
In this grammar there are no useless symbols. Let us assume that initially the 
queue U is [A > B, S — A, B— S, B — A}. The first production we extract from 
the queue (assuming that an element is inserted in the queue ‘from the right’ and is 
extracted ‘from the left’) is: A — B. Thus, we perform Step (2) by first unfolding 
B in A— B. At the end of Step (2) we get: 
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S> AS |A 

A-a |b |S 

Bob |S|A 
Note that we have discarded the production A — A. Since the new unit production 
A — S has been generated, we get the new queue [S — A, B > S, B— A, A > S|. 
Then we extract the production S — A. After a new execution of Step (2) we get: 

S— AS |a |b 

A-a |b |S 

Bob |S|A 
and the new queue [|B — S, B — A, A— S]. We extract B — S from the queue 
and after unfolding S in B — S, we get: 

S— AS |a |b 

A-a |b |S 

Bob |AS|a|A 
and the new queue [B — A, A — S]. We extract B — A from the queue and after 
unfolding A in B — A, we get: 

S— AS |a |b 

A-a |b |S 

Bob |ASla|S 
and the new queue [A — S, B — S], because the new nontrivial unit production 
B — & has been generated. We extract A — S from the queue and after unfolding 
Sin A— S, we get: 

S— AS |a b 

A-a |b AS 

Bob |AS|la |S 


and the new queue |B — S]. We extract B — S from the queue and we get (after 
rearrangement of the productions): 


S— AS |a|b 
A—AS |a |b 
B— AS |a |b 


and the new queue is now empty and the procedure terminates. In this final grammar 
without unit productions the symbol B is useless and we can eliminate it, together 
with the three productions for B, by applying the From-Above Procedure. 0 


REMARK 3.5.13. The use of a stack, instead of a queue, in Algorithm 3.5.11 on 
page 127 for the elimination of unit productions, is not correct. This can be shown 
by considering the grammar with the following productions and axiom S: 

Soa |A 
A—>B |b 
B—Ajļ|a 
and considering the initial stack [S — A, A — B, B — A] with top item S —> A. O 
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REMARK 3.5.14. When eliminating the unit productions from a given extended 
context-free grammar G, we should start from a grammar without ¢-productions, 
that is, we have first to eliminate from the grammar G the ¢-productions and then 
in the derived grammar, call it G’, we have to eliminate the unit productions by 
considering aside the production S — €, which is present in the grammar G” iff 
€ € L(G). If we do not do so and we do not consider the production S — € aside, 
we may end up in an endless loop. This is shown by the following example. 


EXAMPLE 3.5.15. Let us consider the grammar with the following set P of pro- 

ductions and axiom S: 
P: SAS |ale 

A->SA|ale 
We first eliminate the e-productions and we get the following set P1 of productions: 
P1: S>AS |A|S|ale 

A—-SA|A|S|a 
Then we eliminate the unit productions, but we do not keep aside the production 
S — £. Thus, we do not start from the productions: 

SAS |A|S |a 

A-SA|A|S|a 
but, indeed, we apply the procedure for eliminating unit productions starting from 
the set P1 of productions. We get the following productions: 

S— AS |SAlale 

A—>SA|AS |ale 
This set of productions includes the initial set P of productions: we are in an endless 
loop. 


3.5.5. Elimination of Left Recursion. 


Let us consider a context-free grammar G = (Vr, Vy, P, S) without ¢-productions 
and without unit productions. We want to construct an equivalent context-free 
grammar G” = (Vr, Vý, P’, S} such that in P’ there are no left recursive productions 
(see Definition 1.6.5 on page 27). 

The construction of the set P’ can be done by applying the following procedure. 


ALGORITHM 3.5.16. Procedure: Elimination of left recursion. 


Let G = (Vr, Vy, P, S) be the given context-free grammar without ¢-productions 
and without unit productions. We derive an equivalent context-free grammar G” = 
(Vr, Vy, P’, S) without left recursive productions. 

For every nonterminal A for which there is a left recursive production, do the 


following two steps. 
Step (1). Consider all the productions with A in the left hand side. Let they be: 


A— Aa; |... | Aa, (left recursive productions for A) 
A> fp, |... | Bm (non-left recursive productions for A) 
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Step (2). Add a new nonterminal symbol B and replace all the productions whose 
left hand side is A, by the following ones: 


A>, |... | Bm (non-left recursive productions for A) 
A— ßıB |... | BmB (productions for A involving B) 
Boa, |... |an (non-right recursive productions for B) 
BoaB |... | anB (right recursive productions for B) 


Note that after the elimination of left recursion according to this procedure, some 
unit productions may be generated as shown by the following example. 


EXAMPLE 3.5.17. Let us consider the grammar with the following set of produc- 
tions and axiom S: 


S— SA |a 
A-a 


After the elimination of left recursion we get the following set of productions: 


Sra |aZ 
Z—7>A|AZ 
A-a 


Then by eliminating the unit production Z — A, we get the set of productions: 


Sa |aZ 
Za | AZ 
A-a 


The correctness of the above Algorithm 3.5.16 follows from the Arden rule. We 
present the basic idea of that correctness proof through the following example where 
that algorithm is applied in the case n=m=1, a=b, and (3, =a. 


EXAMPLE 3.5.18. Let us consider the following two productions: A — Ab| a. 
By the Arden rule the language produced by A is given by the regular expression 
ab*. Now ab* can be generated from A by using two productions corresponding to 
the two summands of the regular expression: a + abt (which is equal to ab*). We 
need introduce a new nonterminal symbol, say B, which generates the words in b*. 
Thus, we have: 


A-a|aB (A generates the words in a + abt) 
B —b | bB (B generates the words in b?) O 


In the literature we have the following strong notion of a left recursive context-free 
grammar which should not be confused with the one of Definition 1.6.5 on page 27. 


DEFINITION 3.5.19. [Left Recursive Context-Free Grammar. Strong Ver- 
sion| A context-free grammar G = (Vr, Vy, P, S) is said to be left recursive if there 
exists a nonterminal symbol A such that S >% a Aß and A —>ġ Ay, for some 
a,b, yE (Vr U Vyn)“. 
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3.6. Construction of the Chomsky Normal Form 
In this section we show that every extended context-free grammar G has an equiva- 
lent context-free grammar G” in Chomsky normal form which we now define in the 
case where the grammar G has epsilon productions. 


DEFINITION 3.6.1. [Chomsky Normal Form. Version with Epsilon Pro- 
ductions| An extended context-free grammar G is said to be in Chomsky normal 
form if its productions are of the form: 

A— BC for A,B,C € Vy or 

A->a for A € Vy and a € Vr, and 
if e € L(G) then (i) the set of productions of G includes also the production S > €, 
and (ii) S does not occur on the right hand side of any production |1]. 
Ife ¢ L(G) as we assume in the proofs of Theorem 3.11.1 on page 150 and Theo- 
rem 3.14.2 on page 159, then S may occur on the right hand side of the productions. 


THEOREM 3.6.2. [Chomsky Theorem. Version with Epsilon Productions] 
Every extended context-free grammar G has an equivalent S-extended context-free 
grammar G” in Chomsky normal form. 


PROOF. It is based on: (i) the procedure for eliminating the e-productions (see 
Algorithm 3.5.8 on page 126), followed by (ii) the procedure for eliminating the 
unit productions (see Algorithm 3.5.11 on page 127), and by (iii) the procedure for 
putting a grammar in Kuroda normal form (see the proof of Theorem 1.3.11 on 
page 17). 

The proof of this Theorem 3.6.2 justifies the algorithm for constructing the 
Chomsky normal form of an extended context-free grammar which we now present. 
This algorithm is correct even if the axiom S of the given grammar G occurs in 
the right hand side of some production of G. Recall, however, that without loss of 
generality, by Theorem 1.3.6 on page 16 we may assume that the axiom S does not 
occur in the right hand side of any production of G. 


ALGORITHM 3.6.3. 
Procedure: from an extended context-free grammar G = (Vr, Vyn, P, S) to an equiv- 
alent context-free grammar G' = (Vr, Vx, P', S) in Chomsky normal form. 


Step (1). Simplify the grammar. Transform the given grammar G by: 

(i) eliminating e-productions, with the possible exception of S — e iff e € L(G), 
and 

(ii) eliminating unit productions. 

(The elimination of useless symbols is not necessary). Let the derived grammar G* 
be (Vr, Vx, P®, S). We have that Se e P® iff e € L(G). 

Let us consider: (i) a set W of nonterminal symbols initialized to Vs, and (ii) a 
set R of productions initialized to P* — {S > e}. 

Step (2). Reduce the order of the productions. In the set R of productions replace as 
long as possible every production of the form: A — x1x2a, with A € Vy, £1, £2 € 
Vr U Vy, and a € (Vr U Vy)”, by the two productions: 
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A- 1B 
B — x20 
where B is a new nonterminal symbol which is added to W. 
Note that any such replacement reduces the order of a production (see Defini- 
tion 1.3.10 on page 17) by at least one unit. 


Step (3). Promote the terminal symbols. In every production of the form: A > BC 
with A € Vy and B,C € (VrUVy), (i) replace every terminal symbol f occurring in 
BC by a new nonterminal symbol F, (ii) add F to W, and (iii) add the production 
F — f to R. 


The set Vx, of nonterminal symbols and the set P’ of productions we want to 
construct, are defined in terms of the final values of the sets W and R as follows: 

Vy =W, and 

P' = if e € L(G) then RU{S —> e} else R. 


EXAMPLE 3.6.4. Let us consider the grammar with the following productions 
and axiom FE: 


E>E+T |T 
T>TxF |F 
F = (E) | a 


Note that the axiom E does occur on the right hand side of a production. There are 
no €-productions, but there are unit productions. After the elimination of the unit 
productions (it is not necessary to perform the elimination of the left recursion), we 
get: 

E>E+T |TxF|(E) la 

ToTXxF |(E | a 

F = (E) | a 
Then we apply Step (2) of our Algorithm 3.6.3 for deriving the equivalent grammar 
in Chomsky normal form. For instance, we replace E — E +T by: 

E— EA A— PT Pot 


where we have introduced the new nonterminal symbols A and P. By continuing this 
replacement process we get the following equivalent grammar in Chomsky normal 
form: 

E—EA|TB|LC |a 

A— PT 

Pot 

T-TB |LC |a 

B — MF 

M-— x 

FIC |a 

C— ER 

L = ( 

R-) 
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3.7. Construction of the Greibach Normal Form 


In this section we prove Theorem 1.4.4 on page 19. We show that every extended 
context-free grammar G = (Vr, Vy, P, S) has an equivalent context-free grammar 
G’ in Greibach normal form which we now define in the case where the grammar G 
has epsilon productions. 


DEFINITION 3.7.1. [Greibach Normal Form. Version with Epsilon Pro- 
ductions| An extended context-free grammar G = (Vr, Vy, P, SY is said to be in 
Greibach normal form if its productions are of the form: 

Aaa for AE Vy, a€ Vr, aE Vx, and 
if e € L(G) then the set of productions of G includes also the production S —> €. 


We do not insist, as some other authors do (see, for instance, [1, pages 270, 272]), 
that if S — € is a production of the grammar in Greibach normal form, then the 
axiom S' does not occur on the right hand side of any production. Indeed, if S 
occurs on the right hand side of some production then we can always construct an 
equivalent grammar in Greibach normal form where S does not occur on the right 
hand side of any production. 


THEOREM 3.7.2. [Greibach Theorem. Version with Epsilon Produc- 
tions] Every extended context-free grammar G = (Vr, Vy, P, S) has an equivalent 
S-extended context-free grammar G” = (Vp, Vx, P’, S) in Greibach normal form. 


The proof of this theorem is based on the following procedure for constructing 
the sets Vx, and P’. This procedure is correct even if the axiom S of the given 
grammar G occurs in the right hand side of some production of G. Recall, however, 
that without loss of generality, by Theorem 1.3.6 on page 16 we may assume that 
the axiom S does not occur in the right hand side of any production of G. 


ALGORITHM 3.7.3. 
Procedure: from an extended context-free grammar G = (Vr, Vy, P, S) to an equiv- 
alent context-free grammar G! = (Vr, Vn, P’,S) in Greibach normal form. (Ver- 
sion 1) 


Step (1). Simplify the grammar. Transform the given grammar G by: 
(i) eliminating e-productions, with the possible exception of S — e iff € € L(G), 
and 
(ii) eliminating unit productions. 
(The elimination of useless symbols is not necessary). Let the derived grammar G* 
be (Vr, Vx, P®, S). We have that Se e P* if e € L(G). 
Step (2). Draw the dependency graph. Let us consider a directed graph D, called the 
dependency graph, whose set of nodes is Vx, and whose set of arcs is: 

{A; > A;| Ai, A; E€ Vy and A; > Ajy € P® for some y € (Vr UV) TH. 
Step (3). Break the self-loops and the loops. Let us consider: (i) a set W of non- 


terminal symbols initialized to Vý, and (ii) a set R of productions initialized to 


P&§—{S >e}. 
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For each loop in D of the form Ap > A, > ... — A, — Apo, for n > 0, starting 
from the self-loops, that is, the loops of the form Ag — Ag (in which case we have 
that n=0 and Steps (3.1), (3.2), and (3.3) require no action) do the following steps, 
where we assume that y stands for any string in (Vr U W)*: 


(3.1) unfold A; with respect to R in all productions of R of the form Ap > A17, 
thereby updating R, and 

(3.2) unfold A> with respect to R in all productions of R of the form Ap > Aoy, 
thereby updating R, and 
..., and 

(3.3) unfold A, with respect to R in all productions of R of the form Ao > Any, 
thereby updating R, and 

(3.4) eliminate left recursion in the productions of Ap (we do so by applying 
Algorithm 3.5.16, thereby updating R, and when that algorithm is applied, 
one has to choose a fresh, new nonterminal symbol which is added to the 
set W), and 

(3.5) update the graph D as follows: 
(i) if n=0 then erase the arc Ag — Ag, and 
(ii) if n>0 then erase the arc Aj — Aj, and 
(iii) if the new nonterminal symbol chosen at Step (3.4) is Z and in R there 
is a production of the form Z — Ay, for some A € W, then add to the 
graph D the node Z and the arc Z — A. 


Step (4). Go upwards from the leaves. For every arc A; — A; in D such that A; 
and A; belong to W and A; is a leaf of D (that is, it has no outgoing arcs), do the 
following steps: 
(4.1) unfold A; with respect to R in all productions of R of the form A; > A;7, 
for some y € (Vr UW)*, thereby updating R, and 
(4.2) erase the arc A; — A; and erase also the node A; if it has no incoming arcs. 


Step (5). Promote the intermediate terminal symbols. In every production of the 
form: V; > ay with a € Vr and y € (Vp UW)", (i) replace every terminal symbol f 
occurring in y by a new nonterminal symbol F, (ii) add F to W, and (iii) add the 
production F > f to R. 


The set Vx, of nonterminal symbols and the set P’ of productions we want to 
construct, are defined in terms of the final values of the sets W and R as follows: 

Vy =W, and 

P' = if € € L(G) then RU{S —> e} else R. 


Now we make a few remarks on the above Algorithm 3.7.3. 


REMARK 3.7.4. (i) The updating of the dependency graph D at the end of 
Step (3.5) never generates new loops in D. Thus, at the end of Step (3) the graph 
D does not contain loops. 
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(ii) At Step (3) the loops with n # 0 can be considered in any order, while the 
self-loops should be considered first. 


(iii) Step (5) is similar to the step required for constructing the separated form of 
a given grammar and, similarly to the Chomsky normal form, also in the Greibach 
normal form each terminal symbol is generated by a nonterminal symbol. 


EXAMPLE 3.7.5. Let us consider the following grammar with axiom S: 


SAS jals 
A> SA |b 


We start by eliminating the occurrence of the axiom S on the right hand side of 
the productions. This transformation is not actually needed for the construction of 
the Greibach normal form of the given grammar, but we do it anyway (see what we 
have said after Definition 3.7.1 on page 133). 

We introduce the new axiom S’ and we get: 


S — sS 
S— AS |a le 
A—> SA |b 


Then, in the derived grammar we eliminate the -productions and we get: 


S—>sS |e 
S—AS |Alļa 
A—>SAJ|A|b 


We consider the production S’ — € aside, and we construct the Greibach normal 
form of the grammar: 


SoS 
S— AS |A|a 
A—>SA|AJ|b 


We eliminate the trivial unit production A — A. Then we eliminate the unit 
production S’ — S, and by unfolding S in the production S’ > S we get: 


S' 3 AS | Ala 
S—>AS (Aba 
b 


A—SA 
By unfolding A in S’ > A and in S — A, we get: 
S — AS | SA |a |b — (a) 
S— AS | SA |a |b 
A—>SA |b 


Now we perform Steps (2) and (3) of the Algorithm 3.7.3. We have the following 
dependency graph D: 
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5 —) 


NS 


A 


We first break the self-loop of S due to the production S — SA. By applying 
Algorithm 3.5.16 (Elimination of Left Recursion) we replace the productions for S, 
that is: 


S — AS | SA |a |b 

by the following productions: 
S— AS |a |b | ASZ |aZ |bZ — (8) 
Z—7A |AZ — (y) 


where Z is a new nonterminal. We have the new dependency graph: 


S! 


We break the loop A — S — A from A to A in the dependency graph by unfolding 
S in A— SA |b and we get: 

A— ASA |aA | bA | ASZA |aZA |bZA |b 
Then, by eliminating the left recursion for the nonterminal symbol A, we get: 

A— aA |bA |aZA |bZA |b 

|aAY | bAY | aZAY | bDZAY | bY 

Y— SA |SZAJ|SAY |SZAY — (ô) 
where Y is anew nonterminal. We get the new dependency graph without self-loops 
or loops: 


Ss’ 


Now we apply Step (4) of Algorithm 3.7.3. First, (i) we have to unfold A in the 
leftmost positions of the productions (a), (8), and (y), and then (ii) we have to 
unfold S in the productions S’ — SA and (ô). We leave these unfolding steps to 
the reader. After these steps one gets the desired grammar in Greibach normal form 
which, for brevity reasons, we do not list here. Note that in our case Step (5) of 
Algorithm 3.7.3 requires no actions. 0 
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EXAMPLE 3.7.6. Let us consider the following grammar with axiom S: 

S—SA|A la 

A—aA | Aab |e 
It is not necessary to take away the axiom S from the right hand side of the pro- 
ductions. We eliminate the ¢-production A — € and, after the elimination of the 
trivial unit production S — S, we get: 

S—-SA|Ala le 

A—aA |a | Aab | ab 
We consider the production S — € aside and we eliminate the unit production 
S — A. We get the following productions: 

S— SA |aA |a | Aab | ab 

A—aA |a | Aab | ab 


We have the following dependency graph: 


Gs—1D 


We first break the self-loop of A due to the production A — Aab. By applying Algo- 
rithm 3.5.16 (Elimination of Left Recursion) on page 129, we replace the productions 
for A by the following productions: 

A—a |ab |aA|aZ |abZ | aAZ 

Z — ab | abZ 
where Z is a new nonterminal symbol. We then break the self-loop of S' due to the 
production S — SA. By applying again Algorithm 3.5.16 we replace the productions 
for S by the following productions: 

S—a |aA | Aab | ab | aY | aAY | AabY | abY — (a) 

Ysa AY AG) 


where Y is a new nonterminal symbol. Now we apply Step (4) of Algorithm 3.7.3. 
We have to unfold A in the leftmost positions of the productions (a) and (8). We 
leave these unfolding steps to the reader. We leave to the reader also Step (5). After 
these steps one gets the desired Greibach normal form. 

Note that the language L generated by the given grammar is a regular language. 
Indeed, by the Arden rule, the language generated by the nonterminal A (see Def- 
inition 1.2.4 on page 11) is a*(ab)*, and L, that is, the language generated by the 
nonterminal S is € + (a + a*(ab)*) (a*(ab)*)* which is equivalent to (a*b)*a*. The 
minimal finite automaton which accepts L can be derived by using the techniques of 
Sections 2.5 and 2.8 and it is depicted in Figure 3.7.1. The corresponding grammar 
in Greibach normal form, obtained as the right linear grammar corresponding to 
that finite automaton, is: 

S—aA |lale 
A-aA|la|bS |b 
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OS 


FIGURE 3.7.1. The minimal finite automaton corresponding to the 
grammar of Example 3.7.6 on page 137 with axiom S and the following 
productions S — SA | A | a, A— aA | Aab | £. The language 
generated by this grammar and accepted by this automaton is 
(atb)*a*. 


EXERCISE 3.7.7. Let us consider the grammar (see Example 3.6.4 on page 132) 
with the following productions and axiom E: 


[oe es ay G 
T>TxF |F 
F = (E) | a 


As indicated in Example 3.6.4 on page 132, after the elimination of the unit pro- 
ductions T — F and E — T (in this order), we get: 


ESELT ITAP | (BE) a 
ToTxFI\(E) |e 
F = (E) | a 


We have the following dependency graph: 
T 


We then break the self-loop of E due to E — E +T and the self-loop of T due to 
T —T x F, and we get: 


E 


ETE Fe TrA NAZ az 


Z= +r | +IZ 

T— (E) |a | (EY | aY 
Ys xF | xFY 

F— (E) |a 


Then, (i) by ‘going up from the leaves’, that is, by unfolding T in the productions 
E —T x F and E —> T x FZ, and (ii) by promoting the two intermediate terminal 
symbols ‘)’ and ‘x’, we get the following grammar in Greibach normal form: 
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E — (ERMF aMF | (ERYMF |aYMF | (ER la 
| (ERMFZ |aMFZ |(ERYMFZ |aYMFZ |(ERZ | aZ 
Z > +T +TZ 
T — (ER a | (ERY | aY 
Y — xF xFY 
F — (ER a 
M — x 
R > ) 


EXERCISE 3.7.8. Let us consider again the grammar of Exercise 3.7.7 on page 137 
with the following productions and axiom E: 


E>E+T |T 
T>TxF |F 
F = (E) | a 


In this exercise we present a new derivation of a grammar in Greibach normal form 
equivalent to that grammar. We will apply an algorithm which is proposed in [9, 
Section 4.6]. In our case it amounts to perform the following actions. We first 
eliminate the left recursive productions for E and T and we get: 


ET |TZ 
Z—o4+T | +TZ 
ToF |FY 
Y—>xF | x FY 
F—(E) |a 


Then, (i) we unfold F in the productions for T and then we unfold T in the pro- 
ductions for Æ, and (ii) we promote the intermediate terminal symbol ‘)’. We get 
the following productions: 


E> (ER | a | (ERY | aY | (ERZ | aZ | (ERYZ | aYZ 
Z—o4+T | +TZ 

T—> (ER |a | (ERY | aY 

Y—>xF | x FY 

F—(ER |a 

R>) 


EXERCISE 3.7.9. Let us consider again the same grammar of Exercise 3.7.7 with 
the following productions and axiom E: 


E>E+T |T 
T>TxF |F 
F = (E) | a 


We can get an equivalent grammar in Greibach normal form by first transforming 
the left recursive productions into right recursive productions, that is, transforming 
every production of the form: A — Aa into a production of the form: A — GA, 
where A € Vy, a, 3 € (Vp UVy)* and A does not occur in a and $. 
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Here is the resulting right recursive grammar, which is equivalent to the given 
grammar: 


EoT+E|T 
T>FxT |F 
F = (E) | a 


The correctness proof of this transformation derives from the fact that, given the 
productions E + E +T | T, the nonterminal symbol F generates the regular lan- 
guage L(E) = T(+T)* which is equal to (T+)*T (we leave it to the reader to do 
the easy proof by induction on the length of the generated word), and thus, L(E) 
can be generated also by the two productions: 


ESTEET 


None of these productions is left recursive. Analogous argument can be applied to 
the two productions T — T x F | F and we get the new productions: 


ToFxT|F 


Then, (i) we unfold F in the productions for T, (ii) we unfold T in the productions 
for E, and (iii) we promote the intermediate terminal symbols ‘)’, ‘x’, and ‘+’. We 
get the following grammar in Greibach normal form: 


E —>aMT |a | (ERMT | (ER | aMTPE | aPE | (ERMTPE | (ERPE 

T —>aMT |a |(ERMT | (ER 

F—(ER |a 

P — + 

M —> x 

R-) 
Note that in this derivation of the Greibach normal form of the given grammar each 
terminal symbol is generated by a nonterminal symbol. 


It can be shown that every extended context-free grammar has an equivalent 
grammar in Short Greibach normal form and in Double Greibach normal form which 
are defined as follows. 


DEFINITION 3.7.10. [Short Greibach Normal Form] A context-free gram- 
mar G is said to be in Short Greibach normal form if its productions are of the 
form: 

A-a for a € Vr 

A—aB for a € Vp and B € Vn 

A—aBC for a € Vr and B,C € Vy 


The set of productions of G includes also the production S — e€ iff € € L(G) (see 
also Definition 3.7.1 on page 133). 


DEFINITION 3.7.11. [Double Greibach Normal Form] A context-free gram- 
mar G is said to be in Double Greibach normal form if its productions are of the 
form: 
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A-a for a € Vr 
A-— ab for a,b € Vr 
A— aYb for a,b € Vr and Y € Vy 


A— aY Zb for a,b € Vr and Y, Z € Vy 


The set of productions of G includes also the production S — e€ iff € € L(G) (see 
also Definition 3.7.1). 


3.8. Theory of Language Equations 


In this section we will present the so called Theory of Language Equations which 
will allow us to present a new algorithm for deriving the Greibach normal form 
of a given context-free grammar. By applying this algorithm, which is based on 
a generalization of the Arden rule (see Section 2.6 starting on page 56), usually 
the number of productions of the derived grammar is smaller than the number 
of productions which are generated by applying Algorithm 3.7.3 (see page 133). 
However, the number of nonterminal symbols may be larger. 


The Theory of Language Equations is parameterized by two alphabets: (i) the 
alphabet Vr of the terminal symbols, and (ii) the alphabet Vy of the nonterminal 
symbols. As usual, we assume that: Vr N Vy = @ and we denote by V the set 
Vr U Vy. The alphabets Vr and Vy are supposed to be fixed for each instance of 
the Theory of Language Equations we will consider. 

A language expression over V is an expression a of the form: 

a == @le|alaptag | aa 
where x € V. Instead of a;+Q2, we also write &a1@s. The operation + between 
language expressions is associative and commutative, while the operation + is asso- 
ciative, but not commutative (indeed, as we will see, it denotes language concate- 
nation). 

Every language expression over V denotes a language as we now specify. 

(i) The language expression Ø denotes the language {} consisting of no words. 

(ii) The language expression £ denotes the language {e}, where € is the empty word. 
(iii) For each xz € Vp, the language expression x denotes the language {x}. 

(iv) The operation +, called sum or addition, denotes union of languages. 

(v) The operation +, called multiplication, denotes concatenation of languages (see 
Section 1.1). 

As usual, the denotation of the language expression x, with x € Vy, is determined 
by an interpretation which associates a language, subset of V7, with each element 
of Vy. 


For every language expression a, a1, and Q2, we have that: 


1 ara=a 
(ii) a+0=0+a=a 
iii) ad=Va=0 


iv 
(v 


(vi 


——_~S 


ae=ea=a 
a (a, + a2) = (aa) + (a a2) 
(a, + a2) a = (a, a) + (a2 a) 


) 
) 
) 
) 
) 
) 


= 
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Each of the above equalities (i)—-(vi) holds because it holds between the languages 
denoted by the language expressions occurring to the left and to the right of the 
equality signs ‘=’. Note that, by using the distributivity laws (v) and (vi), every 
language expression a, with a Æ Ú, is equal to the sum of one or more monomial 
language expressions, that is, language expressions without addition. 

For instance, the language expression a(b + £) is equal to ab + a, which is the 
sum of the two monomial language expressions ab and a. 

A language equation (or an equation, for short) e4 over the pair (Vyn, V} is a 
construct of the form A = a, where A € Vy and a is a language expression over V 
different from A itself. 

A system E of language equations over the pair (Vy,V) is a set of language 
equations over (Vy, V}, one for each nonterminal of Vy. 

A solution of a system of language equations over the pair (Vy, V) is a function s 
which for each A € Vy, defines a language s(A) C V*, called solution language, such 
that if for each A € Vy we consider s(A), instead of A, in every equation of E, and 
we consider union of languages and concatenation of languages, instead of + and =, 
respectively, then we get valid equalities between languages. A solution of a system 
of language equations over the pair (Vy, V} can also be given by providing for each 
A € Vy, a language expression which denotes the language s(A). 

Note that given any system of language equations over the pair (Vyn, V}, we can 
define a partial order, denoted <, between two solutions sı and s2 of that system as 
follows: 


Sı S s2 iff for all A € Vy, 51(A) C s2(A). 
The following definition establishes a correspondence between the sets of context- 


free productions (which may also include ¢-productions) and the sets of systems of 
language equations. 


DEFINITION 3.8.1. [Systems of Language Equations and Context-Free 
Productions] With each system E of language equations over (Vyn, V}, we can 
associate a (possibly empty) set P of context-free productions as follows: we start 
from P being the empty set and then, for each equation A = a in the given system 
E of language equations, 


(i) we do not modify P if a = 0, and 


(ii) we add to P the n productions A > a, |... | Qn, if a = a,+...+a, and the 
a;’s are all monomial language expressions. 


Conversely, given any extended context-free grammar G = (Vr, Vy, P, S) we can 
associate with G a system F of language equations over the pair (Vy, Vr U Vy) 
defined as follows: F is the smallest set of language equations containing for each 
A € Vy, the equation A= a,+...+ Qn, if A— a1 | ... | &n are all the productions 
in P for the nonterminal A. 


DEFINITION 3.8.2. [Systems of Language Equations Represented as Equa- 
tions Between Vectors of Language Expressions] Given the terminal alphabet 
Vr and the nonterminal alphabet Vy = {Aj, Ao,..., Am}, a system E of language 
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equations over (Vn, V}, where V = Vr U Vy, can be represented as an equation 
between vectors as follows: 


[Ai Ao... Am] = [Ai Ag... Aml] Qil Q12 ... Qim + [By Bo... Br] 
Q21 Q22 ... Am 
Qmii Am2 --- Qmm 


where: (i) [Aj A2 ... Am] is the vector of the m (> 1) nonterminal symbols in Vy, 
and (ii) each of the a;;’s and B;’s is a language expression over V. A solution of 
that system F can be represented as a vector [a1, a2,..., @m| of language expressions 
such that for i = 1,...,m, we have that a; denotes the solution language s(A;). 


In the above Definition 3.8.2 matrix addition, denoted by +, and matrix multiplica- 
tion, denoted by + or juxtaposition, are defined, as usual, in terms of addition and 
multiplication of the elements of the matrices themselves. We have, in fact, that 
these elements are language expressions. 


The following example illustrates the way in which we can derive the representa- 
tion of a system of language equations as an equation between vectors as indicated 
in Definition 3.8.2 above. 


EXAMPLE 3.8.3. A System of Language Equations Represented as an Equation 
between Vectors of Language Expressions. Let us consider the terminal alphabet 
Vr = {a,b,c}, the nonterminal alphabet Vy = {A, B}, and the context-free produc- 
tions: 

A— AaB | BB |b 

B-aA |BAa|Bd|le 
These productions can be represented as the following two language equations over 
(Vn, Vr U Vn): 

A= AaB +BB +b 

B=aA BAa+Bd+c 


These two equations can be represented as the following equation between vectors 
of language expressions: 


[A B] = [A B] Bs Pe A + [b aA+c] 
a+ 


Given the nonterminal alphabet Vy = {Aj, A2,..., Am}, for simplicity reasons, in 


what follows we will write A, instead of |A; A2 ... Am] when m is understood from 
the context. Given an mxm matrix R whose elements are language expressions, 

- by R? we denote the matrix whose elements are all Ø, with the exception of the 
elements of the main diagonal which are all the language expression e€, 

- for i > 0, by R**! we denote R’. R, where . denotes multiplication of matrices, 
and 

- by R* we denote >, R’. 
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We have the following theorems which are the generalizations to n dimensions of 
the Arden rule presented in Section 2.6. In stating these theorems we assume that 
m (> 1) denotes the cardinality of the non-empty nonterminal alphabet Vy. 


THEOREM 3.8.4. A system of m(> 1) language equations over (Vy, V}, rep- 
resented as the equation A=AR + B , where A is the m-dimensional vec- 
tor [A Ag ... Am] and Vy = {Aj,Ao,...,Am}, has the minimal solution B R*. 
This minimal solution can also be expressed as the function s such that for each 
i = 1,...,m, s(Ai) = Ujar.m 8(Bj) * s( Ri), where + denotes concatenation of 
languages. 


THEOREM 3.8.5. Let us consider the system E of m (> 1) language equations 
over (Vy, V}, represented as A =A R+B. Let us also consider: (i) the system F1 
of m (> 1) language equations over (Vy, V} represented as ASB Q+ B, where 
Q is an mxm matrix of new nonterminal symbols of the form: 


Qu Qi ... Qim 

Qa Qz ... Qm 

Qmı Qmo sae Qmm 
and (ii) the system F2 of m? language equations over the pair ({Qi1, ---, Qmm}, 
{Qi1;, ---, Qmm} U V} represented as the equation Q = RQ + R whose left hand 
side and right hand side are mxm matrices. The system of language equations 
consisting of the language equations in F1 and in F2, has a minimal solution that, 


— 
when restricted to Vy, is equal to B R* (thus, this minimal solution is equal to the 
minimal solution of the system E). 


PROOF. It is obtained by generalizing the proof of the Arden rule from one 
dimension to n dimensions. Note that the solution of a system of language equations 
is unique if we assume that they are associated with context-free productions none 
of which is a unit production (see Definition 3.5.10 on page 126) or an ¢-production. 
This condition generalizes to n dimensions the condition ‘e ¢ S’ in the case of 
the equation X = S X +T, which we stated for the Arden rule (see Section 2.6 on 
page 56). 


On the basis of the above Theorem 3.8.4 on page 144 and Theorem 3.8.5 on 
page 144, we get the following new algorithm for constructing the Greibach normal 
form of a given context-free grammar G. 


ALGORITHM 3.8.6. 
Procedure: from an extended context-free grammar G = (Vr, Vy, P, S) to an equiv- 
alent context-free grammar in Greibach normal form. (Version 2) 


Step (1). Simplify the grammar. Transform the given grammar G by: 
(i) eliminating ¢-productions, with the exception of S — € iff e € L(G), and 
(ii) eliminating unit productions. 
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(The elimination of useless symbols or left recursion is not necessary). Let the 
derived grammar Gê be the 4-tuple (Vr, Vx, P5, S}. We have that Se e P?® iff 
€ € L(G). 
Step (2). Construct the associated system of language equations represented as an 
equation between vectors. We write the system of language equations over (V$, 
VrU Vý) associated with the grammar G* without the production S — e if it 
occurs in P*. Let that system of language equations be: 

— — => 

A=AR+B. 
Step (3). Construct two systems of language equations represented as equations be- 
tween vectors. We construct the two systems of language equations: 

=> => => 

A=BQ+8B 

Q=RQ+R 
Step (4). Construct the productions associated with the two systems of language 
equations. We derive a context-free grammar H by constructing the productions as- 
sociated with the two systems of language equations of Step (3). In this grammar H: 
(4.1) for each A € Vy, the right hand side of the productions for A begins with a 
terminal symbol in Vr, and (4.2) for each Qi; € {Qi1,---,Qmm} the right hand 
side of the productions for Q;; begins with a symbol in Vr U Vy. By unfolding the 
productions of Point (4.2) with respect to the production of Point (4.1), we make 
the right hand side of all productions to begin with a terminal symbol. 
Step (5). Promote the intermediate terminal symbols. In every production of the 
grammar H: (5.1) replace every terminal symbol f which does not occur at the 
leftmost position of the right hand side of a production, by a new nonterminal 
symbol F, and (5.2) add the production F — f to H. The resulting grammar, 
together with the production S — e if it occurs in P*, is a grammar in Greibach 
normal form equivalent to the given grammar G. 


Note that by applying the above Algorithm 3.8.6, we may generate a grammar with 
useless symbols, as indicated by the following example. 


EXAMPLE 3.8.7. Let us consider the grammar with axiom A and the following 
productions: 
A— AaB | BB |b 
B— aA |BAa|Bd|c 
These productions can be represented (see Example 3.8.3 on page 143) as follows: 
|A B|=[4A PAN f) i aA+c | 
B Aa+d 
From this equation we construct the following two vectors of equations: 
[A B]=[6 aA+c] [Qu Qo] +[b aA+c] 
| Qa Qaz | 


be TAR 0 | e a] k f) | 
Qa Qz B Aa+d Qa Qz B Aa+d 
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From these equations we get the productions: 


A >6Qn a AQa | cQa |b — (a) 
B >bQr aAQ2 | cQ |aA | c e (6) 
Qu >aBQu |aB 

Qi ~aBQy T (7) 
Qa > BQu |4aQa |dQa | B 

Qo > B Qi2 AaQz | dQ2 | Aa | d 


We leave it to the reader to complete the construction of the Greibach normal form 
by: (i) unfolding the nonterminals A and B occurring on the leftmost positions of 
the right hand sides of the above productions by using the productions (a) and (8), 
and (ii) replacing the terminal symbol a occurring on a non-leftmost positions on 
the right hand side of some productions, by the new nonterminal A, and add the 
new production A, > a. 

As the reader may verify, the symbol Q12 is useless and thus, the productions (y), 
B — bQı2, and Qə — BQ». can be discarded. o 


Note that in order to compute the language generated by a context-free grammar 
using the Arden rule, it is not required that the solution be unique. It is enough that 
the solution be minimal. For instance, if we consider the grammar G with axiom S 
and productions: 

S—b|AS|A A-ale 
we get the language equations: 

S=b+AS+A A=a+t+eée 
The Arden rule gives us the solution: S = A*(b+ A) with A = a +€. Thus, 
S = (a+e)*(b+a+e), that is, S = a*b+a*. This solution for S denotes the 
language generated by the given grammar G and it is not a unique solution because 
E€ E€ A. A non-minimal solution for S is the language a*b + a* + a*bb, which is not 
generated by the grammar G. 


3.9. Summary on the Transformations of Context-Free Grammars 


In this section we present a sequence of steps for simplifying and transforming ex- 
tended context-free grammars. During these steps we use various procedures which 
have been introduced in Sections 3.5, 3.6, and 3.7. 

Let us consider an extended context-free grammar G = (Vr, Vy, P, SY which we 
want to simplify and transform. We perform the following four steps. 


Step (1). We first apply the From-Below Procedure (see Algorithm 3.5.1 on page 123) 
for eliminating the symbols which do not produce words in V7, and then the From- 
Above Procedure (see Algorithm 3.5.3 on page 124) for eliminating the symbols which 
do not occur in any sentential form y such that S —-* y. 


Step (2). We eliminate the ¢-productions and derive a grammar which may include 
the production S — £, and no other ¢-productions (see Algorithm 3.5.8 on page 126). 
After this step useless symbols may be generated and we may want to apply again 
the From-Below Procedure. 
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Step (3). We leave aside the production S — eg, if it has been derived during the 
previous Step (2), and we eliminate the unit productions from the remaining pro- 
ductions (see Algorithm 3.5.11 on page 127). After the elimination of the unit 
productions, useless symbols may be generated and we may want to apply again the 
From-Above Procedure. 


Step (4). We produce the Chomsky normal form (see Algorithm 3.6.3 on page 131) 
or the Greibach normal form (see Algorithm 3.7.3 on page 133). In order to do so we 
start from a grammar without unit productions and without ¢-productions, leaving 
aside the production S — e€ if it occurs in the set of productions of the grammar 
derived after Step (2). During this step we may need to apply the procedure for 
eliminating left recursive productions (see Algorithm 3.5.16 on page 129). 

Recall that during the elimination of the left recursive productions, unit pro- 
ductions may be generated and we may want to eliminate them by using Algo- 
rithm 3.5.11 on page 127. Note, however, that in this Step (4), after the elimination 
of unit productions, no subsequent generation of useless symbols is possible. 

In any of the above four steps we do not need the axiom S of the grammar to 
occur only on the left hand side of productions, although it is always possible to get 
an equivalent grammar which satisfies that condition. 


3.10. Self-Embedding Property of Context-Free Grammars 


DEFINITION 3.10.1. |[Self-Embedding Context-Free Grammars| We say 
that an S-extended context-free grammar G = (Vr, Vy, P, S} is self-embedding 
iff there exists a nonterminal symbol A such that: 


(i) A—* a AB with a # £ and 8 # €, that is, a, 8 € (Vr U Vy)”, and 


(ii) A is a useful symbol, that is, S —=ġ a AB —>ġ w for a, 8 € (Vr U Vy)* and 
w € Vz. In that case also the nonterminal symbol A is said to be self-embedding. 


Here are the productions of a self-embedding context-free grammar which gen- 
erates the regular language {a"|n > 1}: S-a|laa|aSa 


THEOREM 3.10.2. [Context-Free Grammars That Are Not Self-Embed- 
ding] If G is an S-extended context-free grammar which is not self-embedding then 
L(G) is a regular language. 


PROOF. Without loss of generality, we may assume that the grammar G has no 
unit productions, no ¢-productions, no useless symbols, and the axiom S does not 
occur to the right hand side of any production. If the production © — e€ exists in 
the grammar G, it may only contribute to the word £, and thus, its presence is not 
significant for this proof. The proof consists of the following two points. 


Point (1). We first prove that given any context-free grammar which is not self- 
embedding, its Greibach normal form is not self-embedding either. This comes from 
the following two facts: 

(1.1) the elimination of left recursion does not introduce self-embedding, and 
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(1.2) the application of the rewritings of Step (1) through Step (5) when producing 
the Greibach normal form (see Algorithm 3.7.3 on page 133), does not introduce 
self-embedding. 

Proof of (1.1). Let us consider, without loss of generality, the transformation from 
the old productions: 


A— A6B|¥ 
to the new productions: 
A—>y |T T —$ß | BT 


We will show that: 

(1.1.1) if A is self-embedding in the new grammar then A is self-embedding in the 
old grammar, and 

(1.1.2) if T is self-embedding in the new grammar then T is self-embedding in the 
old grammar. 

Proof of (1.1.1) If A is self-embedding in the new grammar we have that: either 
y —* uAv, with u # £ and v ¥ «, in which case A is self-embedding in the old 
grammar, or T —* uAv, with u # £ and v # £, in which case 8 —* uAv in the new 
grammar and this implies that A is self-embedding in the old grammar. 

Proof of (1.1.2) If T is self-embedding in the new grammar we have that: 8 —>* uTv, 
with u Æ £ and v Æ £. But this is impossible because T is a fresh, new nonterminal 
symbol. 

Proof of (1.2). The rewritings of Step (1) through Step (5) when producing the 
Greibach normal form do not introduce self-embedding because they correspond to 
possible derivations in the grammar we have before the rewritings, and we know 
that that grammar is not self-embedding. This completes the proof of Point (1). 


Point (2). Now we prove that for every context-free grammar H in Greibach normal 
form which is not self-embedding, there exists a constant k such that for all u if 
S —;,, u then u has at most k nonterminal symbols. 
Proof of (2). Let us consider a context-free grammar H. Let Vy be the set of 
nonterminal symbols of H and let Vr be the set of terminal symbols of H. The 
productions of H are of one of the following three forms: 

(a) Aa (b) A>aB (c) A>ao 
where A, B € Vy, a € Vr, o € Vy, and |o] > 2. 
Suppose also that in the productions of H, |ø| is at most m. Suppose also that 
|\V| = h > 1. We have that every sentential form obtained by a leftmost derivation 
has at most h-m nonterminal symbols. This can be proved by absurdum. 

Indeed, if a sentential form, say y, has more than h-m nonterminal symbols 
then the number of the leftmost derivation steps using productions of the form (c), 
when producing y from S, is at least [(h-m)/(m—1)], because at most m—1 
nonterminal symbols are added to the sentential form in each leftmost derivation 
step which uses a production of the form (c). Since h > 1 and m > 2, we have that 
[(h-m)/(m—1)] > h+1, and since h is the number of nonterminal symbols in the 
grammar H, we also have that the leftmost derivation S —7, ~ is such that there 
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FIGURE 3.10.1. Parse tree of the word y = abbcabC ABC ABAC con- 
structed by leftmost derivations. The grammar is assumed to be in 
Greibach normal form. A production of type (a) is of the form: A — a, 
a production of type (b) is of the form: A — a B, and a production 
of type (c) is of the form: A — aø, with |o| > 2. In this picture |o| 
is always 3. 


exists a nonterminal symbol A € Vy such that S >} uAy —>fr vAz —y p where 
u,v € Vp and y, z € Vx. In other words, (i) at least one nonterminal symbol, say A, 
occurs twice in a path of the parse tree of y from the root S to a leaf, and (ii) that 
path is constructed by applying more than h times a production of the form (c). 
Thus, the grammar H is self-embedding. 

To see this the reader may also consider Figure 3.10.1, where we have depicted a 
parse tree from the axiom S to the sentential p = abbcabCABCABAC. If the path 
from S down to the lowest occurrence of C is due to more than A derivation steps 
of the form (c) then there must be a nonterminal symbol which occurs twice in the 
labels of the black nodes. This means that the grammar is self-embedding. 

This completes the proof that every sentential form obtained by a leftmost deriva- 
tion has at most h-m nonterminal symbols. Now we can conclude the proof of the 
theorem as follows. 

We recall that in the construction of a finite automaton corresponding to a 
regular grammar the production A — aB corresponds to an edge from a state 
A to a state B labeled by a. Thus, we can encode each k-tuple of nonterminal 
symbols which occurs in any sentential form of any production of any grammar H 
in Greibach normal form which is not self-embedding, into a distinct state and we 
can derive a finite automaton corresponding to H. This shows that L(H) is a regular 
language. 


REMARK 3.10.3. In the above proof the condition that the derivation should be 
a leftmost derivation is necessary. Indeed, let us consider the grammar G whose 
productions are: 


S—-aAS|a A> a 
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It is not self-embedding because: (i) S is not self-embedding and (ii) A is not self- 
embedding. However, the following derivation which is not a leftmost derivation 
(indeed, it is a rightmost one), produces a sentential form with n nonterminal sym- 
bols, for any n > 1: 


S—-aAS—>aAaAS—...—>(aA)"S. O 
THEOREM 3.10.4. A context-free language (possibly including ¢) is regular iff 


it can be generated by an S-extended context-free grammar which is not self- 
embedding. 


PROOF. (if part) See Theorem 3.10.2. (only if part) No regular grammar is self- 
embedding because nonterminal symbols, if any, are only on the rightmost positions 
of any sentential form. 0 


3.11. Pumping Lemma for Context-Free Languages 


The following theorem has been proved in [4]. It is also called the Pumping Lemma 
for context-free languages. Recall that for every grammar G, by L(G) we denote the 
language generated by G. This lemma provides a necessary condition which ensures 
that a grammar is a context-free grammar. 


THEOREM 3.11.1. |[Bar-Hillel Theorem. Pumping Lemma for Context- 
Free Languages] For every context-free grammar G there exists n > 0, called a 
pumping length of the grammar G, depending on G only, such that for all z € L(G), 
if |z| > n then there exist the words u,v, w, x,y such that: 

(i) z=uvwey, 
(ii) va #e, 
(iii) jvwa| <n, and 
(iv) for alli > 0, uv’wa'y € L(G). 
The minimum value of the pumping length n is said to be the minimum pumping 
length of the grammar G. 


PROOF. Let L denote the language L(G). Consider the grammar Gc in Chomsky 
normal form which generates L — {e}. Thus, in particular, the production S — € 
does not belong to Gc. We first prove by induction the following property where 
we assume that the length of a path ny — ng — ... — Nm on a parse tree from node 
nı to node nm, is m—1. 


Property (A): for any i > 1, if a word x € L has a parse tree according to the 
grammar Gc with its longest path of length i then |z| < 2°71. 

(Basis) For i = 1 the length of x is 1 because the parse tree of x is the one with 
root S and a unique son-node zx (recall that every production in a grammar in 
Chomsky normal form whose right hand side has terminal symbols only, is of the 
form A —> a). 

(Step) We assume Property (A) for i = h > 1. We will show it for i = h+1. If the 
length of the longest path of the parse tree of x is h+1 then the root S of the parse 
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i < k+1 
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zy 22 23 


Z4 25 


FIGURE 3.11.1. The parse tree of the word 21 22 23 24 25. The gram- 
mar has no ¢-productions and it is in Chomsky normal form with k 
nonterminal symbols. All the nonterminal symbols on the path from 
the upper A to the leaf b are distinct, except for the two A’s. That 


path A—...—A b includes at most k + 2 nodes and, thus, 
its length is at most k+1. 


tree of x has two son-nodes which are the roots of two subtrees, say tı and tz, each 
of which has its longest path whose length is no greater than h. By induction, the 
yield of tı is a word whose length is not greater than 2’~!. Likewise the yield of tə 
is a word whose length is not greater than 2’~!. Thus, the length of x is not greater 
than 2”. This concludes the proof of Property (A). 

Now let k be the number of nonterminal symbols in the grammar Gc. Let us 
consider a word z such that |z| > 2". By Property (A) in any parse tree of z 
there is a path, say p, of length greater than k. Thus, since in Go there are k 
nonterminal symbols, in the path p there is at least a nonterminal symbol which 
appears twice. Let us consider the two nodes, say nı and ng, of the path p with 
the same nonterminal symbol, say A, such that the node nı is an ancestor of the 


node nz and the nonterminal symbols in the nodes below nı are all distinct (see 
Figure 3.11.1). 


Now, 
- at node nı we have that A —>* z.Az, and 
- at node nz we have that A —>* 23. 


We also have that the length of the path from n; to (and including) a leaf of the 
subtree rooted in ng is at most k+1 because the nonterminal symbols in that path 
are all distinct. Thus, by Property (A), |z22324| < 2". The value n whose existence 
is stipulated by the lemma is 2* and it depends on the grammar Gç only, because 
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k is the number of nonterminal symbols in Go. The fact that |z22324| < 2* shows 
Point (iii) of the lemma. 

We also have that A —* z} Az‘ for any i > 0 because we can replace the occur- 
rence A on the right hand side of A -* z.Az4 by z2Az4 as many times as desired. 
This shows Point (iv) of this theorem. 

The yield z of the given parse tree can be written as uz2z3z4y for some word u 
and y. 

Since in the grammar Go in Chomsky normal form there are no unit productions, 
we cannot have A —* A and thus, we have that |z2z4| > 0. This shows Point (ii) of 
the lemma and the proof is completed. 


COROLLARY 3.11.2. The language L = {a'b |i > 1} is not a context-free lan- 
guage, and the language L = {atb |i > 0} cannot be generated by an S-extended 
context-free grammar. 


PROOF. Suppose that L is a context-free language and let G be a context-free 

grammar which generates L. Let us apply the Pumping Lemma (see Theorem 3.11.1 
on page 150) to a word uvwry = a"b"c” where n is the number whose existence is 
stipulated by the lemma. Then wv?w2?y is in L(G). 
Case (1). Let us consider the case when v #4 e. The word v cannot be across the 
a-b boundary because otherwise in uv?wx7y there will be b’s to the left of some a’s. 
Likewise v cannot be across the b-c boundary. Thus, v lies entirely within a” or b” 
or c”. For the same reason z lies entirely within a” or b” or c”. 

Assume that v is within a”. In wv2wa?y the number of a’s is n+ |v]. Since x lies 
entirely within a” or b” or c” it is impossible to have in uv?wx?y the number of b’s 
equal to n + |v| and also the number of c’s equal to n + |v|, because x should lie at 
the same time within the b’s and the c’s without lying within across any boundary. 
Thus, uv?°wg?y is not in L(G). 

Case (2). Let us consider the case when v = £. The word g is different from e. x lies 
within the a’s or b’s or c’s because it cannot lie across any boundary (In that case, 
in fact, in x? there will be a b to the left of an a). Let us assume that g lies within 
the a’s. The number of a’s in uv?wa?y = uwxy is n + |x|, while the number of b’s 
and c’s is n. Thus, uv?wx’y is not in L(G). Likewise, one can show that x cannot 
lie within the b’s or within the c’s, and the proof of the corollary is completed. 


We have that also the following languages are not context-free: 
Lı = {abd | 1<i<j}, 
Lo = {a'bic® | 1<i<j <k}, 
L3 = {a'b’c® | i#j and j#k and i¥k and 1<i, j, k}, 
Lı = {abiddi | 1<i, j}, 
Ls = {abab | 1<i, j}, 
Le = {dbi čd! | i=0 or 1<j=k=l}. 
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Let us consider the alphabet © with at least two symbols distinct from c. We have 
that the following languages are not context-free: 


L; = {wcw | w € &*}, 
Lg = {ww | w € £F}. 


The above results concerning the languages Lı through Le can be extended to the 
case where the bound ‘1 < ...’ is replaced by the new bound ‘0 < ...’ in the sense 
that the languages with the new bounds cannot be generated by any S-extended 
context-free grammar. Also the language {ww|w € U*} cannot be generated by 
any S-extended context-free grammar. 


Notice, however, that the following languages are context-free: 
Lg = {ab | 1<i}, 
Lio = {atb cĂ | G47 or 77k) and 1 <2, j, k}, 
Ly, = {ab idi | 1<i, j}, 
Ly = {a bid d | 1<i, j}, 
Lis = {a'b’c® | (i=j or j=k) and 1<i, j, k} where ‘or’ is the ‘inclusive or’. 


In particular, the following grammar G13: 


S — S&C | AS» 
Sı >a Sıb | ab 
Sy — bSoce | be 
A —>aA a 
C —>cC c 


generates the language L13. 

The above results concerning the languages Lg through Liz can be extended to 
the case where the bound ‘1 < ...’ is replaced by the new bound ‘0 < ...’ in the 
sense that the languages with the new bounds can be generated by an S-extended 
context-free grammar. 

Let us consider the alphabet © with at least two symbols distinct from c. Let 
w? denote the word w with its symbols in the reverse order (see Definition 2.12.3 
on page 95). We have that the following languages are context-free: 


Lia ={wew® | we X*}, 
Lis = {ww® | w € dF}. 


The language {ww|w € d*} can be generated by an S-extended context-free 
grammar and, in particular, the grammar: 


S—elaSa|bSb 
generates the language {w w? | w € {a, b}*}. 
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EXERCISE 3.11.3. Show that the languages {02” | n > 1}, fon? | n > 1}, and 
{0? | p > 2 and prime(p)}, where prime(p) holds iff p is a prime number, are not 
context-free languages. 


Hint. Use the Pumping Lemma for context-free languages. Alternatively, use 
Theorem 7.8.1 on page 232 and show that these languages are not regular because 
they do not satisfy the Pumping Lemma for regular languages (see Theorem 2.9.1 
on page 72). 


We have the following fact. 


FACT 3.11.4. [The Pumping Lemma for Context-Free Languages is not 
a Sufficient Condition] The Pumping Lemma for context-free languages is a nec- 
essary, but not a sufficient condition for a language to be context-free. Thus, there 
are languages which satisfy this Pumping Lemma and are not context-free. 


PROOF. Let us first consider the following languages Ly and L,, where prime(p) 
holds iff p is a prime number: 


L; = {a"bc |n > 0} 
L, = {a? ba” ca” | p > 2 and prime(p) and n > 0} 
Let L be the language LeU L,. First, in Point (i) we will prove that L is not context- 


free, and then in Point (ii) we will show that L satisfies the Pumping Lemma for 
the context-free languages. 


Point (i). Assume by absurdum that L is context-free. We have that L, = 
LA (X* — a* bc) is context-free because regular languages are closed under comple- 
ment (see Theorem 2.12.2 on page 94) and context-free languages are closed under 
intersection with regular languages (see Theorem 3.13.4 on page 158). 

Now the class of context-free languages is a full AFL and it is closed under GSM 
mapping (see Table 4 on page 227 and Table 5 on page 229). 

Let us consider the following generalized sequential machine which realizes the 


GSM mapping: 
aja afe 
lo j oe 
c/e o 


Thus, the language M(L,) is {a? | p > 2 and prime(p)} and it is context-free. By 
Theorem 7.8.1 on page 232 M(L,) is a regular language. Now we get a contradiction 
by showing that M(L,) is not regular because it does not satisfy the Pumping 
Lemma for the regular languages. Indeed, for any d > 1 we have that there exist 
p22 and k>0 such that p+kd is not prime (if we take k=p we get that: p+k d = 
p+pd = p(1+d), and thus, p+kd is not prime). 
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Point (ii). Now we prove that L satisfies the Pumping Lemma for the context- 
free languages. If the word w which is sufficiently long (that is, |w| > n, where 
n is the constant whose existence is stated by the Pumping Lemma) belongs to 
a* bc then the Pumping Lemma holds by placing the four divisions of w within the 
subword a*. Otherwise, if w € Lp, then there are two cases: 

Case (ii.1) w = a? ba” ca” and n=0, and 

Case (ii.2) w = a? ba" ca” and n>0. 

Case (ii.1) is similar to the case where w is in a* bc. 
In Case (ii.2) the four divisions of w can be taken as follows: a? b | a” | c | a” |. (Note 
that if n=1 then the word a? bc € a* bc.) This completes the proof of Point (ii). 


REMARK 3.11.5. As it is clear from the proof of Point (i), in order to get a 
language L which satisfies the Pumping Lemma for the context-free languages and 
it is not context-free, instead of the predicate m(p) =4ef p> 2 and prime(p), we may 
use any other definition of the predicate m(p) such that {a? | 7(p)} is not a regular 
language. 


3.12. Ambiguity and Inherent Ambiguity 


DEFINITION 3.12.1. [Ambiguous and Unambiguous Context-Free Gram- 
mar| A context-free grammar such that there exists a word w with at least two 
distinct parse trees is said to be ambiguous. A context-free grammar is not ambigu- 
ous is said to be unambiguous. 


We get an equivalent definition if in the above definition we replace ‘two parse trees’ 
by ‘two leftmost derivations’ or ‘two rightmost derivations’. This is due to the fact 
that there is a bijection between the parse trees and the leftmost (or rightmost) 
derivations of the words which are their yield. 

The grammar with the following productions is ambiguous: 


S — Aj | Ag 
Ai a 


As —>a 


Indeed, we have these two parse trees for the word a: 


Let us consider the grammar G which generates the language 
L(G) = {w |w has an equal number of a’s and b’s and |w|>1} 


and whose productions are: 
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S—bA|aB 
A—a |aS |bAA 
Bob |bS |aBB 
The grammar G is an ambiguous grammar. Indeed, for the word aabbab € L(G) 
there are the two parse trees depicted in Figure 3.12.1. 
A grammar G may be ambiguous into two different ways: either (i) there exists 
a word in L(G) with two derivation trees which are different without taking into 
consideration the labels in their nodes, or (ii) there exists a word in L(G) with 
two different derivation trees which are different if we take into consideration the 
symbols in their nodes. 
We have Case (i) for the grammar with productions: 
S—aA|aa 
A-a 
and for the word aa (see the trees U, and U, in Figure 3.12.2 on page 156). 
We have Case (ii) for the grammar with productions: 
S—-aS|a|aA 
A-a 
and for the word aa (see the trees V, and V, in Figure 3.12.2). 


S S 
J N y s 
a B a B 
JID% AS 
a B B 


a B B 
a SEAN /\ il 
b b S b S b 
/ \ / \ 
a B b A 
l l 
b a 


FIGURE 3.12.1. Two parse trees of the word aabbab. 


Tree Ua Tree Up, Tree Va Tree V, 
S S S S 
ZN / N / N AN 
a a a A a S a A 
| | | 
a a a 


FIGURE 3.12.2. Two pairs of derivation trees for the word aa. 
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DEFINITION 3.12.2. [Inherently Ambiguous Context-Free Language] A 
context-free language L is said to be inherently ambiguous iff every context-free 
grammar which generates L is ambiguous. 


We state without proof the following statements. 

The language Lis = {a'b’c* | (i=j or j=k) and i, j,k >1} where the ‘or’ is an 
‘inclusive or’, is a context-free language which is inherently ambiguous. On page 153 
we have given the context-free grammar G3 which generates this language. 

Also the language {a"b"c'"d™ |m,n > 1} U {a"b™c™d" |m,n > 1} is a context- 
free language which is inherently ambiguous. 


3.13. Closure Properties of Context-Free Languages 


In this section we present some closure properties of the context-free languages. We 
have the following results. 


THEOREM 3.13.1. The class of context-free languages are closed under: (i) con- 
catenation, (ii) union, and (iii) Kleene star. 


PROOF. Let the language Lı be generated by the context-free grammar G = 
(Vir, Vin, Pi, S1) and the language L be generated by the context-free grammar 
Gə = (Var, Van, P2, S2). We can always enforce that the terminal and nonterminal 
symbols of the two grammars to be disjoint. 

(i) Lı " Lə is generated by the grammar 

G = (Vir U Vor, Vin U Van U {S}, P U P> U {S = 5159}, Sh. 
(ii) Ly U Lg is generated by the grammar 

G = (Vir U Vor, Vin U Van U {S}, Pi U P» B {S = Sı | Se}, S}. 

(iii) Lý is generated by the grammar G = (Vir, Vin U {9}, P, U{S — e| S15}, S) 
(this grammar is an S-extended context-free grammar). Note that Lf is generated 


by the context-free grammar G = (Vir, Vin U {S}, P, U{S — S,| SS}, S). 


THEOREM 3.13.2. The class of context-free languages are not closed under 
intersection. 


PROOF. Let us consider the language Lı = {a'b |i >1 and j >1} generated 
by the grammar with axiom Sı and the following productions: 


Sı ~ AC 
A —aAb | ab 
C —cC |e 


and the context-free language Lo = {a'b’c)|i > 1 and j > 1} generated by the 
grammar with axiom S and the following productions: 

Sy ~AB 

A >aA |a 

B —>bBc |bce 
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The language Lı N Ly = {abt |i > 1} is not a context-free language (see Corol- 
lary 3.11.2 on page 152). c 


THEOREM 3.13.3. The class of context-free languages are not closed under com- 
plementation. 


PROOF. Since the context-free languages are closed under union, if they were 
closed under complementation, then they would also be closed under intersection, 
and this is not the case (see Theorem 3.13.2 on page 157). oO 


One can show that the complement of a context-free language is a context- 
sensitive language. 


THEOREM 3.13.4. If Lis a context-free language and R a regular language then 
LN R is a context-free language. 


PROOF. The reader may find the proof in [9, page 135]. The proof is based 
on the fact that the pda which accepts L can be run in parallel with the finite 
automaton which accepts R. The resulting parallel machine accepts LN R. 0 


Given a language L, L? denotes the language {w? |w € L}, where w? denotes 
the word w with its characters in the reverse order (see Definition 2.12.3 on page 95). 
We have the following theorem. 


THEOREM 3.13.5. If L is a context-free language then the language LË is context- 
free. 


PROOF. Consider the Chomsky normal form G of a grammar which gener- 
ates L. The language L? is generated by the grammar G” where for every production 
A — BC in G we consider, instead, the production A > CB. L 


THEOREM 3.13.6. The language L? = {ww |w € {0,1}*} is not context-free. 


PROOF. If LP were a context-free language then by Theorem 3.13.4, also the 
language Z = LP M 0t1t0*1*, that is, {0°170°17 |i > 1 and j > 1} would be a 
context-free language, while it is not (see language Ls on page 152). 0 


With reference to Theorem 3.13.6, if we know that L C {0}*, then L? is made 
of all words with even length. Thus, L? is regular. Indeed, we have the following 
result whose proof is left to the reader (see also Section 7.8 on page 232). 


THEOREM 3.13.7. If we consider an alphabet © with one symbol only, then a 
language L C %* is a context-free language iff L is a regular language. 


3.14. Basic Decidable Properties of Context-Free Languages 


In this section we present a few decidable properties of context-free languages. The 
reader who is not familiar with the concept of decidable and undecidable proper- 
ties (or problems) may refer to Chapter 6, where more results on decidability and 
undecidability of properties of context-free languages are listed. 
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THEOREM 3.14.1. Given any context-free grammar G, it is decidable whether 
or not L(G) is empty. 


PROOF. We can check whether or not L(G) is empty by checking whether or not 
the axiom of the grammar G produces a string of terminal symbols. This can be 
done by applying the From-Below Procedure (see Algorithm 3.5.1 on page 123). 


THEOREM 3.14.2. Given any context-free grammar G, it is decidable whether 
or not L(G) is finite. 


PROOF. We consider the grammar H such that: (i) L(H) = L(G)—{e}, and 
(ii) H is in Chomsky normal form with neither useless symbols nor ¢-productions. 
We construct a directed graph whose nodes are the nonterminals of H such that 
there exists an edge from node A to node B iff there exists a production of H of the 
form A — BC or A — CB for some nonterminal C. L(G) is finite iff in the directed 
graph there are no loops. If there are loops, in fact, there exists a nonterminal A 
such that A —>* aAG with |ab| > 0 [9, page 137]. 


As an immediate consequence of this theorem we have that given any context-free 
grammar G, it is decidable whether or not L(G) is infinite. 


3.15. Parsers for Context-Free Languages 


In this section we present two parsing algorithms for context-free languages: (i) the 
Cocke-Younger-Kasami Parser, (ii) the Earley Parser. 


3.15.1. The Cocke-Younger-Kasami Parser. 
The Cocke-Younger-Kasami algorithm is a parsing algorithm which works for any 
context-free grammar G in Chomsky normal form without ¢-productions. The com- 
plexity of this algorithm is, as we will show, of order O(n*) in time and O(n?) in 
space, where n is the length of the word to parse. We have to check whether or not a 
given word w = a,...@, is in L(G). The Cocke-Younger-Kasami algorithm is based 
on the construction of a matrix nxn, called the recognition matriz. The element of 
the matrix in row 7 and column 7 is the set of the nonterminal symbols from which 
the substring a; a;41...@;4;-1 can be generated (this substring has length 7 and its 
first symbol is in position 7). 

We will see this algorithm in action in the following example. 

Let G be the grammar ({S, A, B,C, D,E, F}, {a,b}, P, S} whose set P of produc- 
tions is: 

S—CB|FA | FB 

A-CS |FD |a 

B—FS | CE |b 

Ca 

D — AA 

E — BB 

F—b 
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referee 
pots [ss [esl 


FIGURE 3.15.1. The recognition matrix for the string aababb of 
length n=6 and the grammar G given at the beginning of this Sec- 
tion 3.15.1. 


The recognition matrix for the string w = aababb is the one depicted in Fig- 
ure 3.15.1. We have the following correspondence between the symbols of w and 
their positions: 
w aaoba ob b 
position: 123456 


In the given string w we have that the substring of length 3 starting at position 3 
is bab, and the substring of length 2 starting at position 4 is ab. 


The recognition matrix is upper triangular and only half of its entries are sig- 
nificant (see Figure 3.15.1). The various rows of the recognition matrix are filled as 
we now indicate. 

In the recognition matrix we place the nonterminal symbol V in row 1 and 
column j, that is, in position (1,7), for j = 1,...,n, iff the terminal symbol, say 
aj, in position j, that is, the substring of length 1 starting at position j, can be 
generated from V, that is, V — aj is a production in P. Now, since a is the 
terminal symbol in position 1 of the given string, we place in row 1 and column 1 
of the recognition matrix the two nonterminal symbols A (because A — a) and C 
(because C — a). 


In the recognition matrix we place the nonterminal symbol V in row 2 and 
column j, that is, in position (2,7), for j =1,...,n—1, iff the substring of length 2 
starting at position j can be generated from V, that is, V —* aj a;+1 (and this is 
the case iff V => XY and X — aj and Y — aj41). For instance, since the substring 
of length 2 starting at position 3 is ba, we place in row 2 and column 3 of the 
recognition matrix the nonterminal symbol S because S — FA and F — b and 
Aa. 
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In general, in the recognition matrix we place the nonterminal symbol V in row 
i and column j, that is, in position (i,7), fori =1,...,n, and j =1,...,n—(i-1), 
iff 
V — XY and X is in position (1,7) and Y is in position (i-1, j+1) or 
V — XY and X is in position (2,7) and Y is in position (i—2, 7+2) or 
. Or 


V — XY and X is in position (i—1, j) and Y is in position (1, j+i —1). 


In Figure 3.15.2 we have indicated as small black circles the pairs of positions of the 
recognition matrix that we have to consider when filling the position (i, j} (depicted 
as a white circle). 


Les. gear e 


FIGURE 3.15.2. Construction of the element © in row ¿ and col- 
umn j, that is, in position (i, j}, of the recognition matrix of the 
Cocke-Younger-Kasami parser. That element is derived from the ele- 
ments in the following i—1 pairs of positions: ((1, 7), (i—1, j+1))}, ..., 
((i-1,7), (1,7 +i-1)). The length of the string to parse is n and the 
position (n,1) is indicated by ©. 


It is easy to see that the given string w belongs to L(G) iff the axiom S occurs 
in position (|w|,1) (see the position © in Figure 3.15.2). 

The time complexity of the Cocke-Younger-Kasami algorithm is given by the 
time of constructing the recognition matrix which is computed as follows. 

Let n be the length of the string to parse. Let us assume that given a context-free 
grammar G = (Vr, Vy, P, S} in Chomsky normal form without ¢-productions: 


(i) given a set Są subset of Vp, it takes one unit of time to find the maximal subset 
Sa of Vy such that for each A € S4, (i.1) there exists a € Sa, and (i.2) A> a is a 
production in P, 

(ii) given any two subsets Sg and So of Vy, it takes one unit of time to find the 
maximal subset S4 of Vy such that for each A € S4, (ii.1) there exist B € Sp and 
C € Sc, and (ii.2) A — BC is a production in P, and 


(iii) any other operation takes 0 units of time. 
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We have that: 


— row 1 of the recognition matrix is filled in n units of times, 
— row 2 of the recognition matrix is filled in (n—1) x 1 units of times, 
--+, and in general, for any i, with 2< i< n, 


— row i of the recognition matrix is filled in (n — i + 1) x (i — 1) units of times 
(indeed, in row 7 we have to fill n — 1 +i entries and to fill each entry it requires 
i — 1 operations of the type (ii) above). 


Thus, since $; i? = n(n + 0.5)(n + 1)/3 (see [11, page 55]), we have that: 
n+) (n= i+ 1)(¢— 1) = (n? + 5n)/6. (t) 


This equality shows that the time complexity of the Cocke-Younger-Kasami algo- 
rithm is of the order O(n?). (In order to validate the above equality (t), it is enough 
to check it for four distinct values of n, because it is an equality between polynomials 
of degree 3. For instance, we may choose the values 0, 1,2, and 3.) 


3.15.2. The Earley Parser. 


Let us consider an extended context-free grammar G = (Vr, Vy, P, S}. We do not 
make any restrictive assumption on this grammar: it may be ambiguous or not, it 
may be with or without ¢-productions, it may or may not include unit productions, 
it may or may not be left recursive, and the axiom S may or may not occur on the 
right hand side of the productions. 

Let us begin by introducing the following notion. 


DEFINITION 3.15.1. Given an extended context-free grammar G = (Vr, Vy, P, S) 
and a word w € V$ of length n, a [dotted production, position] pair is a construct of 
the form: [A > a.p, i] where: (1) A — a£ is a production in P, that is, A € Vyn 
and a, 3 € (Vj UVr)*, and (2) i is an integer in {0,1,...,n}. 


ALGORITHM 3.15.2. Earley Parser. 


Given an extended context-free grammar G, let us consider a word w = a,...dy, 
and let us check whether or not w belongs to L(G). If n = 0, w is the empty string, 
denoted €. 

We construct a sequence (Jo, 1, ... In} of n+1 sets of [dotted production, position] 
pairs. 


Construct the set Jp as follows: 


R1. (Initialization Rule for Jo) 
For each production S — a in the set of productions P, add [S — .a, 0]. 
In particular, if a = £ we add [S — ., 0]. 

R2. Apply the closure rules C1 and C2 (see below) to the set To. 
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For each 0 < j < n, construct the set J; from the set J;_1 as follows: 
R3. (Initialization Rule for J; with j >0) 
For each [A > a.a; B, i] € Ij-1, add [A > aa; . 8, i] to Jj. 
R2. Apply the closure rules C1 and C2 (see below) to the set J;. 
The closure rules C1 and C2 for the set J;, for 7 = 0,...,n, are as follows: 
Cl. (Forward Closure Rule) 
if [A> a. B$, i] € I; and B — y is a production 
then add [B > .¥, j] to Ij. 
C2. (Backward Closure Rule) 
then for every [B — 8. Ay, k] in J; with i<j, add [B — GA. y, k] to Ij. 
As established by the following Corollary 3.15.4, we have that w € L(G) iff 
[S — a., 0] € In, for some production S — a of G. 


We have the following theorem and corollary which establish the correctness of the 
Earley parser. We state them without proof. 


THEOREM 3.15.3. For every i > 0, j > 0, for every a1...a; € V7, and for every 
a,BeV*,[A-a. 8, i] € L; iff there is a leftmost derivation 

Do Otc Oy AY Sieg Oy OPS a.. aj bY. 
The reader will note that i < 7 because we have considered a leftmost derivation. 


As a consequence of Theorem 3.15.3 we have the following corollary. 


COROLLARY 3.15.4. Given an extended context-free grammar G, the word w = 
a1.. -ân E L(G) iff |S —> a., 0] € I, for some production S — a of G. 


Thus, in particular, for any extended context-free grammar G, we have that: 
e£ € L(G) iff [S —> a., 0] € Ip for some production S — a of G. 


We will not explain here the ideas which motivate the Forward Closure Rule Cl 
and the Backward Closure Rule C2 of the Earley Parser. (The expert reader will 
understand those rules by comparing them with the Closure Rule for constructing 
LR(1) parsers (see [15, Section 5.4]).) However, in order to help the reader’s in- 
tuition now we give the following informal explanations of the occurrences of the 
[dotted production, position] pairs in the set J;’s, for j = 0,...,n. 

Let us assume that the input string is a,...a, for some n > 0. Let 7 be an 
integer in {0,1,...,n}. 


(1) [A-a.By, i€ J; means that: 
if the input string has been parsed up to the symbol a;, then we parse the 
input string up to the symbol a; (the string ‘a .’ starts at position 7 and 
ends at position j). 

(2) [B> .y, j| € J; means that: 
the input string has been parsed up to a, (‘.’ is in position 7). 
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(3 


YS 


[S > a., 0] €I, means that: 
the input has been parsed from position 0 up to position n (the string ‘a .’ 
starts at position 0 and ends at position n). 


Now let us see the Earley parser in action for the input word: 
ata xX a 


(thus, in our case the length n of the word is 5) and the grammar with axiom EF 
and the following productions: 


E=>E+T E —T 

T>TxF TOF 

F = (E-) Foa 
We construct the following sets Jo, 11,..., Is of [dotted production, position] pairs. 
For k = 1,...,5, the set J; is in correspondence with the k-th character of the given 
input word. We can arrange the sets Io, 4,...,/5 in a sequence whose elements 


correspond to the symbols of the input word, as indicated by the following table: 


Io: 


h: a 
Ip: + 
l: a 
l: x 
Iz: a 
For k = 0,...,5, in the set J, we will list two subsets of [dotted production, position] 


pairs separated by a horizontal line. Above that horizontal line we will list the 
[dotted production, position] pairs which are generated by the rule R1 and R3, 
and below that line we will list the [dotted production, position] pairs which are 
generated by the closure rules C1 and C2. 

For k = 0,...,5, the [dotted production, position] pairs of the set J, are identified 
by the label (km), for m>1. When writing [dotted production, position] pairs, for 
reasons of simplicity, we will feel free to drop the square brackets. 


2) by Cl 
2) by Cl 
4) by Cl 
4) by Cl 


: from ( 
: from 


( 
: from ( 
( 


0 
0 
0 
0 


: from 
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: from (06) by R3 

: from (04) and (11) by C2 
: from (03) and (12) by C2 
: from (02) and (12) by C2 
: from (01 (14) by C2 


0 
0 
0 
0 


and 


Fei SR en 
B= eto 


Bat PO 
a ep) 


The word a+a xa belongs to the language generated by the given grammar because 
[E = E +T., 0] belongs to J; (see line (53) in the set J;). Then, in order to get 
the parse tree of a+ a Xx a, we can proceed as specified by the following procedure 
in three steps. 


ALGORITHM 3.15.5. 
Procedure for generating the parse tree of a given word w of length n parsed by the 
Earley parser. 
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Step (1). Tracing back. First we construct a tree T1 whose nodes are labeled by 
[dotted production, position] pairs as we now indicate. 


(i) The root of the tree is labeled by the [dotted production, position] pair of the 
form [S — a., 0] belonging to the set In, that is, the pair which indicates that the 
given word w belongs to the language generated by the given grammar. 


(ii) A node is a leaf iff in the right hand side of its dotted production there is not 
nonterminal symbol to the left of ‘.’. 


(ii.1) If a node p is not a leaf and is generated from node q by applying Rule R3 or 
Rule C1, then we make q to be the only son-node of p. 

(ii.2) If a node p is not a leaf and is generated from the nodes qı and q> by applying 
Rule C2, then we create two son-nodes of the node p: we make qı to be the left 
son-node of p and q to be the right son-node of p iff q € J; and q2 € J; with 
ORtS 7 Kn. 


Step (2). Pruning. Then, we prune the tree T1 produced at the end of Step (1) by 
erasing every node which is labeled by a [dotted production, position] pair whose 
dot does not occur at the rightmost position. If the node to be erased is not a leaf 
we apply the Rule El depicted in Figure 3.15.3 on page 167. We also erase in each 
[dotted production, position] pair its label, its dot, and its position. Let T2 be the 
tree obtained at the end of this Step (2). 


Step (3). Redrawing. Finally, we apply in a bottom-up fashion, the Rule E2 depicted 
in Figure 3.15.3 to the tree T2 obtained at the end of Step (2), thereby getting the 
parse tree T3 of the given word w. 


Figure 3.15.4 on page 168 shows the tree T1 obtained at the end of Step (1) for the 
word a+axa. Figure 3.15.5 Part (œ) on page 168 shows the tree T2 obtained at the 
end of Step (2) from the tree T1 depicted in Figure 3.15.4. Figure 3.15.5 Part (3) 
on page 168 shows the tree T3 obtained at the end of Step (3) from the tree T2 
depicted in Figure 3.15.5 (a). 

We have the following time complexity results concerning the Earley parser. 
First we need the following definition. 


DEFINITION 3.15.6. [Strongly Unambiguous Context-Free Grammar] A 
context-free grammar G = (Vr, Vy, P, SY is said to be strongly unambiguous if for 
every nonterminal symbol A € Vy and for every string w € Vr” there exists at most 
one leftmost derivation starting from A and producing w. 


Given any context-free grammar, the Earley parser takes O(n?) steps to parse 
any string of length n. If the given context-free grammar is strongly unambiguous 
then the Earley parser takes O(n?) steps to parse any string of length n. Note that 
from any unambiguous context-free grammar (recall Definition 3.12.1 on page 155) 
we can obtain an equivalent strongly unambiguous context-free grammar in linear 
time. 

Every deterministic context-free language can be generated by a context-free 
grammar for which the Earley parser takes O(n) steps to parse any string of length n. 
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p (erasing the node q and 
| applying transitivity) 
q 


E1 
= 


In particular: 


A-a = 


FIGURE 3.15.3. Above: Rule E1 for erasing a node q which is labeled 
by a [dotted production, position] pair whose dot does not occur at 
the rightmost position. Below: Rule E2 for constructing the parse 
tree starting from the tree produced at the end of Step (2). A and B 
are nonterminal symbols, a and c are terminal symbols, and a, 3,7, 
and ô are strings in (Vp UVy)*. By £2(t) we denote the tree obtained 
from the tree t by applying Rule £2. 


3.16. Parsing Classes of Deterministic Context-Free Languages 


In the previous section we have seen that parsing context-free languages can be done, 
in general, in cubic time. Indeed, there is an algorithm for parsing any context-free 
language which in the worst case takes no more than O(n?) time, for an input of 
length n. Actually, L. Valiant [23] proved that the upper bound of the time com- 
plexity for parsing context-free languages is equal to that of matrix multiplication. 
Thus, for the evaluation of the asymptotic complexity, the exponent 3 of n? can be 
lowered to log, 7 (recall the Strassen algorithm for multiplying matrices |20]), and 
even to smaller values. 


However, for the construction of efficient compilers we should be able to parse 
strings of characters in linear time, rather than cubic time. Thus, the strings to be 
parsed should be generated by particular context-free grammars which allow parsing 
in O(n) time complexity. 
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(53): E > E4+T,,0 


(21):E > E+.T,0 (52):T => TxF.,2 
(15): E> E.+T,0 (41):T => Tx.F,2 (51): F—>a.,4 


Ze a 


(01): E—>.E+T,0 ( :E —T.,0 :T—T.xF,2 


OE 00-0: E 0: (28) A E RP BT Se FD 


eee 


(04):T—>.F,0 (11): F—>a.0 (22):T—5.F,2 (31): F-a.,2 


FIGURE 3.15.4. The tree T1 for the word a +a x a (see page 166). 
We have underlined the productions with the symbol ‘.’ at the right- 
most position. The corresponding nodes occur in the tree T2 of Fig- 
ure 3.15.5 (a) on page 168. 


E—>E+T E 
sg = ee aie aon 
E>T T>TxF E + T 
T=F T-F F—a T T A 
| | | | 
Foa F-a F F 
| | 
a a 
(a) Tree T2 (8) Tree T3 


FIGURE 3.15.5. The trees T2 and T3 for the word a+a xa. Tree T3 
is the parse tree of a +a x a (see page 166). 
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There are, indeed, particular classes of context-free languages which allow parsing 
in linear time. Some of these classes are subclasses of the deterministic context- 
free languages (see Section 3.3). Thus, given a context-free language L, it may 
be important to know whether or not L is a deterministic context-free language. 
Unfortunately, in general, we have the following negative result (see also Section 6.1.1 
on page 201). 


FACT 3.16.1. [Undecidability of Testing Determinism of Languages Gen- 
erated by Context-Free Grammars| It is undecidable given a context-free gram- 
mar G, whether or not it generates a deterministic context-free language. 


Given a context-free language L, one can show that L is a nondeterministic 
context-free language by showing that either the complement of L is a nondeter- 
ministic context-free language or that the complement of L is not a context-free 
language. 

The validity of this test follows from the fact that deterministic context-free 
languages are closed under complementation (see Section 3.17 below), while nonde- 
terministic context-free languages are not closed under complementation (see Sec- 
tion 3.13) [1]. 


In the book [15] we have presented some parsing techniques for various subclasses 
of the deterministic context-free languages and, in particular, for: (i) the LL(k) 
languages, (ii) the LR(k) languages, and (iii) the operator-precedence languages |1, 
9|. These techniques are used in the parsing algorithms of the compilers of many 
popular programming languages such as C++ and Java. 


In the following two sections we will present some basic closure and decidability 
results about deterministic context-free languages. These results may allow us to 
check whether or not a given context-free language is deterministic. Thus, if by those 
results one can show that a language is not a deterministic context-free language, 
then one cannot apply the faster parsing techniques for LL(k) languages or LR(k) 
languages or operator-precedence languages that we have mentioned above. 


Recall that a deterministic context-free language can be given by providing either 
(i) the instructions of a deterministic pda which accepts it, or (ii) a context-free 
grammar which is an LR(k) grammar, for some k>1 [15, Section 5.1]. 

Actually, for any deterministic context-free language one can find an LR(1) gram- 
mar, which generates it [9]. 


3.17. Closure Properties of Deterministic Context-Free Languages 
We have the following results (see also Section 7.5 on page 224). 


THEOREM 3.17.1. [Closure of Deterministic Context-Free Languages Un- 
der Complementation] Let L be a deterministic context-free language. Then 
o*—L is a deterministic context-free language. 


PROOF. It is not immediate and can be found in |9, page 238]. 
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THEOREM 3.17.2. The class of the deterministic context-free languages is closed 
under intersection with a regular set. 


PROOF. Similar to the one of Theorem 3.13.4 on page 158. L 


THEOREM 3.17.3. The class of the deterministic context-free languages is not 
closed under concatenation, union, intersection, Kleene star, reversal. 


PROOF. See |9, pages 247 and 281] and also [8, page 346]. 


3.18. Decidable Properties of Deterministic Context-Free Languages 


In this section we will present some decidable properties of deterministic context- 
free languages. A more comprehensive list of decidability and undecidability results 
concerning deterministic context-free languages can be found in Sections 6.1-6.4 (see 
also |9, pages 246-247]). 

We assume that every deterministic context-free language we consider in this 
section is a subset of V} for some terminal alphabet Vr with at least two symbols. 


(D1) It is decidable given a deterministic context-free language L and a regular 
language R, to test whether or not L = R. 


(D6) It is decidable given a deterministic context-free language L, to test whether 
or not L is prefix-free (see Definition 3.3.9 on page 120) [8, page 355]. 


(D7) It is decidable given any two deterministic context-free languages L1 and L2, 
to test whether or not L1 = L2 [19]. 


Properties (D2)—(D5) that do not appear in this listing, are some more decid- 
able properties of the deterministic context-free languages which we will present in 
Section 6.2 starting on page 204. 

With reference to Property (D7), note that, on the contrary, it is undecidable to 
test whether or not L(G1) = L(G2) for any given two context-free grammars G1 
and G2. 


We have also the following undecidability results. 


(U1) It is undecidable given any two deterministic context-free languages L1 and L2, 
to test whether or not L1 N L2 = Ø, and 


(U2) It is undecidable given any two deterministic context-free languages L1 and L2, 
to test whether or not L1 C L2. 


Note that the problem (U2) of testing whether or not L1 C L2 can be reduced 
to the problem of testing whether or not L1 N (Vr*— L2) = Ø. Since deterministic 
context-free languages are closed under complementation, we get that Vr*— L2 is 
a deterministic context-free language and, thus, undecidability of (U1) follows from 
undecidability of (U2). 


CHAPTER 4 


Linear Bounded Automata and Context-Sensitive Grammars 


In this chapter we first show that the notions of context-sensitive grammars and 
type 1 grammars are equivalent. Then we show that every context-sensitive language 
is a recursive set. Finally, we introduce the class of the linear bounded automata 
and we show that these automata are characterized by the fact that they accept the 
context-sensitive languages. 

We assume that the reader is familiar with the basic notions and properties of 
Turing Machines which we will present in Chapter 5 below. More information on 
Turing Machines can be found in textbooks such as [9]. 


In this chapter, unless otherwise specified, we use the notions of type 1 produc- 
tions, grammars, and languages which we have introduced in Definition 1.5.7 on 
page 21. We recall them here for the reader’s convenience. 


DEFINITION 4.0.1. [Type 1 Production, Grammar, and Language. Ver- 
sion with Epsilon Productions| Given a grammar G = (Vr, Vy, P, S}, we say 
that a production in P is of type 1 iff 
(i.1) either it is of the form a — 3, where a € (Vr UVy)t, 8 € (Vr U Vyn)”, and 
la| < |G], or it is S —> £, and 
(i.2) the axiom S does not occur on the right hand side of any production if the 
production S — £ is in P. 

A grammar is said to be of type 1 if all its productions are of type 1. A language is 
said to be of type 1 if it is generated by a type 1 grammar. 


We also use the following notions of context-sensitive productions, grammars, 
and languages which we have introduced in Definition 1.5.7. 


DEFINITION 4.0.2. [Context-Sensitive Production, Grammar, and Lan- 
guage. Version with Epsilon Productions] Given a grammar G = (Vr, Vy, P, S}, 
a production in P is said to be contezt-sensitive iff 
(i) either it is of the form uAv — uwv, where u,v € V*, A € Vy, and w € (VrUVy)*, 
or it is S —> £, and 
(ii) the axiom S does not occur on the right hand side of any production if the 
production S — £ is in P. 

A grammar is said to be context-sensitive if all its productions are context-sensitive. 
A language is said to be context-sensitive if it is generated by a context-sensitive 
grammar. 


171 


172 4. LINEAR BOUNDED AUTOMATA AND CONTEXT-SENSITIVE GRAMMARS 


Let us start by proving the following Theorem 4.0.3. This theorem generalizes 
Theorem 1.3.4 which we stated on page 13. Indeed, in this Theorem 4.0.3 the equiv- 
alence between type 1 grammars and context-sensitive grammars is established with 
reference to the above Definitions 4.0.1 and 4.0.2, rather than to the Definitions 1.3.1 
and 1.3.3 (see pages 13 and 13, respectively). 


THEOREM 4.0.3. [Equivalence Between Type 1 Grammars and Context- 
Sensitive Grammars. Version with Epsilon Productions] With reference to 
Definitions 4.0.1 and 4.0.2 we have that: (i) for every type 1 grammar there exists an 
equivalent context-sensitive grammar, and (ii) for every context-sensitive grammar 
there exists an equivalent type 1 grammar. 


PROOF. (i) For every given type 1 grammar G we first construct the equivalent 
grammar, call it G,, in separated form. Let G, be (Vr, Vy, P, S}. Then, from G, we 
construct the grammar G” = (Vr, Vx, P’, S) which is a context-sensitive grammar 
as follows. The set P’ of productions is constructed from the set P by considering 
the following productions: 

(i.1) 

(i.2) every production of P of the form A — a, and 
i.3) 


(i. 


S — e, if it occurs in P, 


for every not context-sensitive production of P of the form: 
(a) Ai... Am > Bı... Bn 


with 1 < m < n, such that Aj,...,Am,Bi,...,Bn E Vy, the following context- 
sensitive productions, where the symbols C;’s are new nonterminal symbols not 
in Vy: 

Ai... Am > Cia... Am 

CAs... Am > CyCz... Am 


(8) CCo Daa Cm—1Am = CCo San CmBm41 Sak By 
CCo set CmBm+1 ste Ban = BC T CHP isd nee: Bn 


Bı Bə wae Bm-1CmBm41 Pos Bn =F B, Bə sae BmBm+1 aw B,, 


We leave it to the reader to show that the replacement of every production of the 
form (a) by the productions of the form (8) does not modify the language generated 
by the grammar. The set Vý consists of the nonterminal symbols of Vy and all the 
symbols C;’s which occur in the productions of the form (8). 

(ii) The proof of this point is obvious because every context-sensitive production is 
a production of type 1. 0 


Having proved this theorem, when speaking about languages, we will feel free to 
use the qualification ‘type 1’, instead of ‘context-sensitive’, and vice versa. 
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Let us consider an alphabet Vr and the set of all words in V$. 


DEFINITION 4.0.4. [Recursive (or Decidable) Language] We say that a lan- 
guage L C Vř is recursive (or decidable) iff there exists a Turing Machine M which 
accepts every word w belonging to L and rejects every word w which does not belong 
to L (see Definition 5.0.6 on page 186 and Definition 6.0.5 on page 195). 


THEOREM 4.0.5. [Recursiveness (or Decidability) of Context Sensitive 
Languages| Every context-sensitive grammar G = (Vr, Vn, P, S) generates a lan- 
guage L(G) which is a recursive subset of V$. 


PROOF. Let us consider a context-sensitive grammar G = (Vr, Vy, P, S) and a 
word w € (Vr U Vy)*. We have to check whether or not w € L(G). If w = e€ it is 
enough to check whether or not the production S — € is in P. Now let us assume 
that w # £. Let the length |w| of w be n (>1) and let d be the cardinality of VpUVy. 

Since context-sensitive grammars are type 1 grammars, during every derivation 
we get a sequence of sentential forms whose length cannot decrease. Now, since for 
any k >0, there are d* distinct words of length k in (VpUVy)*, if during a derivation 
a sentential form has length k, then at most d* derivation steps can be performed 
before deriving either an already derived sentential form or a new sentential form of 
length at least k+1. 

Thus, for any given word w € Vf, by exploring all possible derivations starting 
from the axiom S, for at most d+ d?+...+d'”! derivation steps, we will encounter 


w iff w € L(G). 


The generate-and-test algorithm we have described in the proof of the above 
Theorem 4.0.5, can be considerably improved as indicated by the following Algo- 
rithm 4.0.6. 


ALGORITHM 4.0.6. Testing whether or not a given word w belongs to the lan- 
guage generated by the type 1 grammar G = (Vr, Vy, P, S) (see Definition 4.0.1 on 
page 171). 


We are given a type 1 grammar G = (Vr, Vyn, P, S). Without loss of generality, 
we may assume that the axiom S does not occur on the right hand side of any 
production. We are also given a word w € V$. 

We have that € € L(G) iff the production S —> € is in P. If w # € and |w| = n, 
we construct a sequence (To, Ti,- -, Ts} of subsets of (Vr U Vy)” recursively defined 
as follows: 


To = {5} 
Tm+1 = Tm U {a| for some o € Tm, o >g a and |a| < n} 


until we construct a set T, such that T; = 7.44. 
We have that w € VF is in L(G) if w € T,. 
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We leave it to the reader to prove the correctness of this algorithm. That proof is a 
consequence of the following facts, where n denotes the length of the word w and d 
denotes the cardinality of Vr U Vy: 


(i) for any m > 0, the set Tm is a finite set of strings in (Vp U Vy)” such that 
SG aand lal <n, 


(ii) the number of strings in (Vp U Vy)* whose length is not greater than n, is 
dade +... +d, 


(iii) for all m > 0, if Tm # dss then Tm C Tm+1, and 
(iv) if for some s > 0, T; = T41 then for all p with p > s, Ts = Tp. 


From these facts it follows that the sequence (To, Ti,- ., Ts} of sets of words is 
finite and the algorithm terminates. 


Now we give an example of use of Algorithm 4.0.6. 

EXAMPLE 4.0.7. Let us consider the grammar with axiom S and the following 
productions: 

1. S-aSBC 

S—-aBC 
CBBC 
aB—ab 
bB — bb 
bC be 


T. cC — cc 
The language generated by that grammar is L(G) = {a"b"c"|n > 1}. Let us 
consider the word w = abac and let us check whether or not abac € L(G). We have 
that |w| = 4. By applying Algorithm 4.0.6 we get the following sequence of sets: 


Ty = {S} 

T,={S, aSBC, aBC } 

T> = {S, aSBC,aBC, abC} 

T; ={S, aSBC,aBC, abC, abc} 
Ti = T} 


ON OT HS HO IN 


Note that when constructing Tə from T7}, we have not included the sentential form 
aaBC'BC which can be derived from aS BC by applying the production S > a BC, 
because |aaBCBC| = 6 > 4 = |w|. We have that abac ¢ L(G) because abac ¢ T}. 
Indeed, abac Æ a"b"c" for all n > 1. L 


Now we show the correspondence between linear bounded automata and context- 
sensitive languages. 


4. LINEAR BOUNDED AUTOMATA AND CONTEXT-SENSITIVE GRAMMARS 175 


DEFINITION 4.0.8. [Linear Bounded Automaton] A linear bounded automa- 
ton (or LBA, for short) is a nondeterministic Turing Machine M (see Definition 5.0.1 
on page 184) such that: 

(i) the input alphabet is SU {¢, $}, where ¢ and $ are two distinguished symbols not 
in &, which are used as the left endmarker and the right endmarker of any input 
word w € È, so that the initial tape configuration is ¢w$ (see Definition 5.0.3 on 
page 185) and the cell scanned by the tape head is the leftmost one with the symbol 
¢, and 

(ii) M moves neither to the left of the cell with ¢, nor to the right of the cell with $, 
and if M scans the cell with ¢ (or $) then M prints ¢ (or $, respectively) and moves 
to the right (or to the left, respectively). 

More formally, a linear bounded automaton is a tuple of the form (Q, 4,1, qo, ¢, 
$, F,ô), where: Q is a finite set of states, X is the input alphabet, T is the tape 
alphabet, qo in Q is the initial state, ¢ is the left endmarker, $ is the right endmarker, 
F C Qisthe set of final states, and 6 is a partial function from Q xT to Powerset(Q x 
T x {L, R}), called the transition function. 


With respect to the definition of a Turing Machine we note that: 
(i) for a linear bounded automaton there is no need of the blank symbol B, and 


(ii) the codomain of the transition function ô is Powerset(QxIx {L, R}), rather than 
Q x (T — {B}) x {L, R}, because we have assumed that unless otherwise specified, 
a linear bounded automaton is nondeterministic. 


The notion of a language accepted by an LBA is the one used for a Turing 
Machine, that is, the notion of acceptance is by final state (see Definition 5.0.6 on 
page 186). 

It can be shown that if we extend the notion of a linear bounded automaton so 
to allow the automaton to use a number of cells which is limited by a linear function 
of n, where n is the length of the input word (instead of being limited by n itself), 
then the class of languages which is accepted by linear bounded automata, does not 
change. 

Now we prove that: 

(i) if L C &* is a type 1 language then it is accepted by a linear bounded automaton, 
and 

(ii) if a linear bounded automaton accepted a language L C X* then L is a type 1 
language in the sense of Definition 4.0.1 on page 171. 


These proofs are very similar to the ones relative to the equivalence of type 0 
grammars and Turing Machines we will present in the following chapter. 


THEOREM 4.0.9. [Equivalence Between Type 1 Grammars and Linear 
Bounded Automata. Part 1| Let us consider any language R C X*, generated 
by a type 1 grammar G = (Vr, Vy, P, S) (thus, we have that the axiom S does not 
occur on the right hand side of any production). Then there exists a linear bounded 
automaton M such that L(M) = R. 
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PROOF. Given the grammar G which generates the language R, we construct 
the LBA M with two tapes such that L(M) = R, as follows. Initially, for every 
w € V7, M has on the first tape the word ¢w$. We define L(M) to be the set of 
words w such that ¢w$ is accepted by M. 

If w=e then we make M to accept ¢w $ (that is, ¢ $) iff the production S — € 
occurs in P. Otherwise, if w Æe, M writes on the second tape the initial string 
¢S$. Then M simulates a derivation step of w from S by performing the follow- 
ing Steps (1), (2), and (3). Let o denote the current string on the second tape. 
Step (1): M chooses in a nondeterministic way a production in P, say a > (3, and 
an occurrence of a on the second tape such that |o|—|a|+|3| < |¢w$]. If there is 
no such a choice M stops without accepting ¢w $. Step (2): M rewrites the chosen 
occurrence of a by 8, thereby changing the value of o. In order to perform this 
rewriting, when |a|<|(@], the LBA M should shift to the right the content of its sec- 
ond tape by applying the so called shifting-over technique for Turing Machines [9]. 
Step (3): M checks whether or not the string o produced on the second tape is equal 
to the string ¢ w$ which is kept unchanged on the first tape. If the two strings are 
equal, M accepts ¢ w$ and stops. If the two strings are not equal, M simulates one 
more derivation step of w from S by performing again Steps (1), (2), and (3) above. 

Now, since for each word w € R, there exists a sequence of moves of the LBA M 
such that M accepts ¢ w$, we have that w € R iff w € L(M). 


THEOREM 4.0.10. [Equivalence Between Type 1 Grammars and Linear 
Bounded Automata. Part 2| For any language A C X* such that there exists a 
linear bounded automaton M = (Q, X, I, ô, qo, ¢, $, F} which accepts a language A, 
that is, A = L(M), then there exists a type 1 grammar G = (X, Vy, P, A1), where 
X is the set of terminal symbols, Vy is the set of nonterminal symbols, P is the 
finite set of productions, and A, is the axiom, such that A = L(G), that is, A is the 
language generated by G. 


PROOF. Given the linear bounded automaton M and a word w € &*, we construct 
a type 1 grammar G which first makes two copies of w and then simulates the 
behaviour of M on one copy. If M accepts w then G generates w, otherwise G does 
not generate w. In order to avoid the shortening of the generated sentential form 
when the state and the endmarker symbols need to be erased, we have to incorporate 
the state and the endmarker symbols into the nonterminals. 

We will now give the rules for constructing the set P of productions of the 
grammar G. In these productions the pairs of the form [—,—] are symbols of the 
nonterminal alphabet Vy. 

The productions 0.1, 1.1, N.1, N.2, and N.3, listed below, are necessary for gen- 
erating the initial configuration qo ¢ a; a2...an $ (see Definition 5.0.2 on page 184) 
when the input word is a, a2...an, for N> 1. For N= 0, the input word is the 
empty string £ and the initial configuration is qg¢$. As usual, in any configuration 
we write the state immediately to the left of the scanned symbol and thus, for N > 0, 
the tape head initially scans the symbol ¢. 
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Here are the productions needed when N=0. Their label is of the form 0.k. If 
the linear bounded automaton M eventually enters a final state and N =0, then M 
accepts a set of words which includes the empty string £. We need the production: 


0.1 A, > |e, qo¢$] 
and for every p,q € Q, the productions: 
0.2 |e, pes] — [e, ¢q$] if d(p,¢) = (q, ¢, R) 
0.3 [e, ¢p$] — [e, ¢e$] if 6(p, $) = (q, $, L) 
0.4 A —e _ if there exists a final state in any of the configurations occurring in 
the productions 0.1, 0.2, and 0.3. 


Here are the productions needed when N=1. Their label is of the form 1.k. For 
every a,b,d E€ X, and q,p E Q, we need the productions: 


1.1 A, — [a, qota$ 

1.2 [a, qeb$] — [a, ¢pb$] if d(¢,¢) = (p, ¢, R) 
1.3 [a, ¢qb$] — |a, ped$] if (q, b) = (p, d, L) 
1.4 [a, ¢qb$] — |a, ¢dp$] if ô(q,b) = (p, d, R) 
1.5 [a, ¢bq$] — [a, ¢pb$] if 6(¢,$) = (p, $, L) 


For every a,b € X, and q € F, we need the productions: 


1.6 [a, q¢b$] — a 
1.7 [a, ¢qb$] — a 
1.8 [a, ¢bq$] — a 


The above productions of the form 1.6, 1.7, and 1.8 should be used for generating a 
word in &* when the linear bounded automaton M enters a final state. 


Here are the productions needed when N >1. Their label is of the form N.k. For 
each a,b, d € X, and q,p € Q, we need the productions: 
N.1 A; > fa, qota] Ag 
N.2 A — |a, a] Ao 
N.3 A — |a, a$] 
N.4 a, qed] — |a, cpb] if ô(q, ¢) = (p, ¢, R) 
N.5 [a,¢qb] — |a, ped] if ô(q,b) = (p, d, L) 


For every a,b € ÈX, q,p € Q such that (q,a) = (p, b, R), for every ak, ak+1,d € &, 
we need the productions: 


N.6.1 lak, tqa] [ak+1, d] T lak, ¢b] [an41, pd] 
N.6.2 [ax, ¢ga] [ax+1, d$] > [ax, ¢b] [ax+1, pd$] 
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N.6.3  [ax, qa] |ak+ı,d] — lar, b] [ax+i, pd] 
N.6.4  [ax, qa] [an4i1, d$] — lar, b] [ak+ı, pd$] 


For every a,b € X, q,p € Q, such that d(q,a) = (p, b, L), for every ak, ak+ı,d € X, 
we need the productions: 


N.7.1  [ax, ¢d] lak+1; qa] — [ax, ¢pd] [ax41, b] 
N.7.2 [ax, ¢d] [ax41, qa$] — [ax, ¢pd] [ax+1, b$] 
N.7.3 [ag,d] [ak+1,qa] —> [ar, pd] [ak+1, b] 

[ 


N.7.4 [aņp, d] [|ak+1,qa$] — [ax, pd] [ak+1, b$] 

For every a,b, d € X, and q,p € Q, we need the productions: 
N.8 [a,qb$] — [a,dp$] if ô(q,b) = (p, d, R) 

N.9 [a,bq$] — [a, pb$] if 6(¢,$) = (p, $, L) 

For every a,b, d € X, and q € F, the productions: 

N.10 [a,q¢b] — a 


N.11 fa, ¢qb] 
N.12 [a,qb$] — 
N.13 [a,bg$] 


N.14 [a qb] >a 

N.15 [a,d]b — ab 
N.16 [a,¢d]b — ab 
N.17 bla,d] — ba 
N.18 bla, d$] — ba 


The productions of the form N.10—N.18 should be used for generating a word in * 
when the linear bounded automaton M enters a final state. 

We will not prove that these productions simulate the behaviour of the LBA M, 
that is, for any w € UT, w is generated by G iff w € L(M). We simply make the 
following two observations: 

(i) the ‘first component’ of the nonterminals |[—, —] are never touched by the pro- 
ductions, so that the given word w is kept unchanged, and 

(ii) never a nonterminal |—, —] is made to be a terminal symbol if a final state q is 
not encountered first. 0 


We have the following facts which we state without proof. 

Every context-free language is accepted by a deterministic linear bounded au- 
tomaton. 

The problem of determining whether or not a given context-sensitive grammar 
G = (Vr, Vy, P, S) without the production S — e€, generates the language ©X* is 
trivial. The answer is ‘no’, because the empty string £ does not belong to L(G). 
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However, the problem of determining whether or not a given context-sensitive 
grammar G without the production S — £, generates the language + is undecid- 
able. 


Fact 4.0.11. [Recursively Enumerable Languages Are Generated by 
Homomorphisms From Context-Sensitive Languages] Given any r.e. set A, 
which is a subset of %:*, there exists a context-sensitive language L such that € ¢ L, 
and a homomorphism h from © to &* such that A = h(L). This homomorphism is 
not, in general, an <-free homomorphism |9, page 230]. 


The class of context-sensitive languages is a Boolean Algebra. Indeed, it is closed 
under: (i) union, (ii) intersection, and (iii) complementation. 

It is open whether or not every context-sensitive language is a deterministic 
context-sensitive language, that is, it is generated by a deterministic linear bounded 
automaton |9, page 229-230]. 


4.1. Recursiveness of Context-Sensitive Languages 


In this section we prove that the class of the context-sensitive languages is a proper 
subclass of the class of the recursive languages. Without loss of generality, let us as- 
sume that the alphabet £ of the languages we consider, is the binary alphabet {0, 1}. 


LEMMA 4.1.1. [Context-Sensitive Languages are Recursive Languages] 
Every context-sensitive language is a recursive language. 


PROOF. It follows from the fact that membership for context-sensitive languages is 
decidable (see Theorem 4.0.5 on page 173). We can also reason directly as follows. 
Given a context-sensitive grammar G = (Vr, Vy, P, S}, we need to show that there 
exists an algorithm which always terminates such that given any word w € V7, tells 
us whether or not w € L(G). It is enough to construct a directed graph whose 
nodes are labeled by strings s in (Vp U Vy )* such that |s| < |w|. Obviously, there is 
a finite number of those strings. In this graph there is an arc from the node labeled 
by the string sı to the node labeled by the string s2 iff we can derive sə from sı 
in one derivation step, by application of a single production of G. The presence 
of an arc between any two nodes can be determined in finite time because there is 
only a finite number of productions in P and the string sı is of finite length. We 
can then determine whether or not there is a path from the node labeled by S' to 
the node labeled by w, by applying a reachability algorithm (see, for instance, |13, 
page 45]). 


Let us introduce the following concept. 


DEFINITION 4.1.2. [Enumeration of Turing Machines or Languages] An 
enumeration of Turing Machines (or languages, subsets of X*) is an algorithm Æ 
(that is, a Turing Machine or a computable function [14, 18]) which given a natural 
number n, always terminates and returns a Turing Machine M, (or a language Ln, 
subset of &*). By abuse of language, also the sequence produced by the algorithm Æ 
for the input values: 0,1,..., is said to be an enumeration. 
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Let |V| be the cardinality of the set N of natural numbers. In the literature |V| 
is also denoted by No (pronounced alef-zero). 


LEMMA 4.1.3. The cardinality of the set of all Turing Machines which always 
halt is |N]. 


PROOF. This lemma is an easy consequence of the Bernstein Theorem (see The- 
orem 7.9.2 on page 235). Indeed, 


(i) | {Z |T is a Turing Machine which always halts} | 
< | {T | T is a Turing Machine} | = |N], and 

(ii) for any n € N, we can construct a Turing Machine T, which always halts and 

returns n. 0 


As a consequence of the following lemma we have that the set of all Turing 
Machines which always halt is not recursively enumerable. 


LEMMA 4.1.4. For every given enumeration (Mo, Mı,...} of Turing Machines 
each of which always halts and recognizes a recursive language subset of {0,1}*, 
there exists a Turing Machine M which always halts and recognizes a recursive 
language subset of {0,1}*, such that it is not in the enumeration. 


PROOF. We stipulate that every word w in {0, 1}*, when used as a subscript of a 
Turing Machine of an enumeration as we will do below, denotes the natural number n 
such that n+1 has the binary expansion 1w. Thus, for instance, € denotes 0, and 10 
denotes 5 (indeed, the binary expansion of 6 is 110). We leave it to the reader to 
show that this denotation provides a bijection between {0,1}* and N. 

Given the enumeration (Mo, Mı, ...}, let us consider the language L C {0, 1}* 
defined as follows: 

L = {w | M, does not accept w}. (a) 

Now, L is recursive because given any word w € {0,1}*, we can compute the num- 
ber n which w denotes when w is used as a subscript of a Turing Machine. Then, 
given n, from the enumeration we get a Turing Machine Mẹ which always halts. 
Therefore, it is decidable whether or not w € L by checking whether or not Mw 
accepts w. 

If by absurdum we assume that all Turing Machines which always terminate are 
in the enumeration, then since L is recursive, there exists in the enumeration also 
the Turing Machine, say M,, which always halts and accepts L, that is, 


Vw € {0,1}*, w€ L iff M, accepts w. (8) 
In particular, for w = z, from (8) we get that: 

z € L iff M, accepts z. (Bs) 
Now, by (a) we have that: 

z € L iff M, does not accept z. (7) 


We have that the sentences (6,) and (y) are contradictory. This completes the proof 
of the lemma. 0 
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THEOREM 4.1.5. |Context-Sensitive Languages are a Proper Subset of 
the Recursive Languages] There exists a recursive language which is not context- 
sensitive. 


PROOF. It is enough: (i) to exhibit an enumeration (Lo, Li,...) of all context- 
sensitive languages (no context-sensitive language should be omitted in that enu- 
meration), and 

(ii) to construct for every context-sensitive language L; in the enumeration, a Turing 
Machine which always halts and accepts Lj. 

From (i) we have that there is an enumeration of all context-sensitive languages. 
Then, by (ii) we have that there exists an enumeration of Turing Machines, each 
of which always halts. Then, by Lemma 4.1.4 there exists a Turing Machine which 
always halts and it is not in the enumeration. This means that there exists a 
recursive language, say L, which is accepted by a Turing Machine which always 
halts and it is not in the enumeration. Thus, L is a recursive language which is not 
a context-sensitive language. 


Proof of (i). Any context-sensitive grammar can be encoded by a natural number 
whose binary expansion is obtained by using the following mapping, where 10” 
stands for 1 followed by n 0’s, for any n > 1: 


0 — 10! 
1 => 10? 
10° 
104 
10° 
10° 
10° 
108 
10° 


—- ~ WWwn- 


FELTA 


A 


For instance, the grammar ({0, 1}, {5, A}, {S — 0S1, S — A10, Al — 01}, S) is 
encoded by a number whose binary expansion is: 


10" ros 1010" 10” 10 00E ae r e i aaa 0 10 10° 
Re ake a SA es g i a A a pa SB 7 


Now if we assume that: 

(i.1) every natural number which encodes a context-sensitive grammar, denotes the 
corresponding context-sensitive language, and 

(i.2) every natural number which is not the encoding of a context-sensitive grammar, 
denotes the empty language (which is a context-sensitive language), 

we have that the sequence (0,1, 2,...) denotes an enumeration (Lo, L1, Lo,...) (with 
repetitions) of all context-sensitive languages. 
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Note that the test we should make at Point (i.2) for checking whether or not a 
natural number is the encoding of a context-sensitive grammar, can be done by a 
Turing Machine which terminates for every given natural number. 


Proof of (ii). This Point (ii) is Lemma 4.1.1 on page 179. Now we give a different 
proof. Let us consider the context-sensitive language Ln which is generated by the 
context-sensitive grammar which is encoded by the natural number n. Then the 
Turing Machine Mp, which always halts and accepts Ln, is the algorithm which by 
using n as a program, tests whether or not a given input word w is in L,. The 
Turing Machine Mn works as follows. Mp starts from the axiom S and generates 
all sentential forms derivable from S by exploring in a breadth-first manner the tree 
of all possible derivations from S. We construct Mp, so that no sentential form is 
generated by Mn, unless all shorter sentential forms have already been generated. 
Mp, can decide whether or not w is in L, by computing the sentential forms whose 
length is not greater than the length of w. (Note that the set of all sentential forms 
whose length is not greater than the length of w is a finite set.) Thus, M, always 
halts and it accepts Ln. 0 


Note that in the proof of this theorem we have constructed an enumeration of 
all context-sensitive languages by providing an enumeration of all context-sensitive 
grammars. Indeed, since a context-sensitive language may be an infinite set of words, 
we need a finite object to denote it and we have chosen that finite object to be the 
grammar which generates the language. 


CHAPTER 5 


Turing Machines and Type 0 Grammars 


In this chapter we establish the equivalence between the class of Turing computable 
languages and the class of type 0 grammars. Before presenting this result, we recall 
some basic notions about the Turing Machines. These machines were introduced by 
the English mathematician Alan Turing in 1936 for formalizing the intuitive notion 
of an algorithm [22]. 

Informally, a Turing Machine M consists of: 
(i) a finite automaton FA, also called the control, 
(ii) a one-way infinite tape, which is an infinite sequence {c;|i € N, i>0} of cells 
c's, and 
(iii) a tape head which at any given time is on a single cell. When the tape head is 
on the cell c; we will also say that the tape head scans the cell c;. 

The cell which the tape head scans, is called the scanned cell and it can be read 
and written by the tape head. Each cell contains exactly one of the symbols of the 
tape alphabet T. The states of the automaton FA are also called internal states, or 
simply states, of the Turing Machine M. 

We say that the Turing Machine M is in state q, or q is the current state of M, 
if the automaton FA is in state q, or q is the current state of FA, respectively. 

We assume a left-to-right orientation of the tape by stipulating that for any 
i > 0, the cell c+, is immediately to the right of the cell c;. 


A Turing Machine M behaves as follows. It starts with a tape containing in 
its leftmost n (> 0) cells c1 c2 ... Cn a sequence of n input symbols from the input 
alphabet X}, while all other cells contain the symbol B, called blank, belonging to T. 
We assume that: X C [—{B}. If n = 0 then, initially, the blank symbol B is 
in every cell of the tape. The Turing Machine M starts with its tape head on the 
leftmost cell, that is, c1, and its control, that is, the automaton FA in its initial 
state qo. 

An instruction (or a quintuple) of the Turing Machine is a structure of the form: 

di, Xn — Gj, Xk, M 
where: (i) q; E€ Q is the current state of the automaton FA, 

(ii) Xa €T is the scanned symbol, that is, the symbol of the scanned cell that is 
read by the tape head, 

(iii) q; € Q is the new state of the automaton FA, 

(iv) X; ET is the printed symbol, that is, the non-blank symbol of F which replaces 
Xp on the scanned cell when the instruction is executed, and 
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(v) m € {L, R} is a value which denotes that, after the execution of the instruction, 
the tape head moves either one cell to the left, if m= L, or one cell to the right, 
if m=R. Initially and when the tape head of a Turing Machine scans the leftmost 
cell cı of the tape, m must be R. 


Given a Turing Machine M, if no two instructions of that machine have the 
same current state q; and scanned symbol Xn, we say that the Turing Machine M 
is deterministic. 


Since it is assumed that the printed symbol X; is not the blank symbol B, we 
have that if the tape head scans a cell with a blank symbol then: (i) every symbol 
to the left of that cell is not a blank symbol, and (ii) every symbol to the right of 
that cell is a blank symbol. 

Here is the formal definition of a Turing Machine. 


DEFINITION 5.0.1. [Turing Machine] A Turing Machine (or a deterministic 
Turing Machine) is a septuple of the form (Q, £, T, qo, B, F, ô), where: 
- Q is the set of states, 
- X} is the input alphabet, 
- T is the tape alphabet, 
- qo in Q is the initial state, 
- BinT is the blank symbol, 
- F CQ is the set of final states, and 
- ô isa partial function from Q xT to Q x (T—{B}) x {L, R}, called the transition 
function, which defines the set of instructions or quintuples of the Turing Machine. 
We assume that Q and I are disjoint sets and & C T—{B}. 


We may extend the definition of a Turing Machine by allowing the transition 
function ô to be a partial function from the set Q x T to the set of the subsets of 
Q x (T —{B}) x {L, R} (not to the set Q x (T—{B}) x {L, R}). In that case it 
is possible that two quintuples of 6 have the same first two components and if this 
is the case, we say that the Turing Machine is nondeterministic. Unless otherwise 
specified, the Turing Machines we consider are assumed to be deterministic. 


Let us consider a Turing Machine whose leftmost part of the tape consists of the 
cells: 

Cy C2 . . . Ch—1 Ch - - - Ck 
where cz, with 1 < k, is the rightmost cell with a non-blank symbol, and cp, with 
1 < h < k+1, is the cell scanned by the tape head. 


DEFINITION 5.0.2. [Configuration of a Turing Machine] A configuration of 
a Turing Machine M whose tape head scans the cell c for some h > 1, such that 
the cells containing a non-blank symbol in I are c,...c,g, for some k > 0, with 
1 <h < k+1, is the triple a; q &2, where: 
- a, is the (possibly empty) word in ((—{B})""! written in the cells cy c2... Ch—1, 
one symbol per cell from left to right, 


- q is the current state of the Turing Machine M, and 
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finite automaton FA in state q 
states: Q X = {a,b,c,d} 
initial state: qo T = {a,b,c,d, B} 
final states: F 

(the tape head moves 

to the left and to the right) 


FIGURE 5.0.1. A Turing Machine in the configuration a; q Q2, that 
is, bbaqabd. The head scans the cell c4 and reads the symbol a. 


- if the tape head scans a cell with a non-blank symbol, that is, 1< h< k, then ag 
is the non-empty word of [*~’*! written in the cells c,...c,, one symbol per cell 
from left to right, else if the tape head scans a cell with the blank symbol B, then 
Q is the sequence of one B only, that is, h = k+1. 

For each configuration y = a, qQ2, we assume that: (i) the tape head scans the 
leftmost symbol of a2, and (ii) we say that q is the state in the configuration y. 


In Figure 5.0.1 we have depicted a Turing Machine whose configuration is a ,qa2. 


If the word w = aja2...Qy is initially written, one symbol per cell, on the n 
leftmost cells of the tape of a Turing Machine M and all other cells contain B, then 
the initial configuration of M is qow, that is, the configuration where: (i) a, is 
the empty sequence €, (ii) the state of M is the initial state go, and (iii) ag = w. 
The word w of the initial configuration is said to be the input word for the Turing 
Machine M. 


DEFINITION 5.0.3. [Tape Configuration of a Turing Machine] Given a Tur- 
ing Machine whose configuration is a, qa2, we say that its tape configuration is the 
string &ı Qz in [*. 


Now we give the definition of a move of a Turing Machine. By this notion we 
characterize the execution of an instruction as a pair of configurations, that is, (i) the 
configuration ‘before the execution’ of the instruction, and (ii) the configuration 
‘after the execution’ of the instruction. 


DEFINITION 5.0.4. [Move (or Transition) of a Turing Machine] Given a 
Turing Machine M, its move relation (or transition relation), denoted >m, is a 
subset of Cm x Cm, where Cy, is the set of configurations of M, such that for any 
state p,q E€ Q, for any tape symbol X1,..., Xi-2;, Xj_-1, Xi, Xig,---, Xn, Y ET, 
either: 

1. if ô(q, Xi) = (p, Y, L) and Xı T .Xi-2X;i—1 A € then 


Xı ies Xj_9X4_1 q XiXi+ı ra Xn >M Xi kaci Xj_-2p Xj-1 Y Xia ci Xn 
or 
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2. if ô(q, Xi) = (p, Y, R) then 
Xı sue Xi-2Xi—1 qXiXi+ı a Xn >M Xı fae Xj_9X5_1 Y pXja4 ee Xn 


In Case (1) of this definition we have added the condition X,...X;-2Xi-1 # € 
because the tape head has to move to the left, and thus, ‘before the move’, it should 
not scan the leftmost cell of the tape. 

When the transition function ô of a Turing Machine M is applied to the current 
state and the scanned symbol, we have that the current configuration y, is changed 
into a new configuration y2. In this case we say that M makes the move from 7 
to y2 and we write y1 >m Y2- 

As usual, the reflexive and transitive closure of the relation — ), is denoted 
by >M- 

The following definition introduces various concepts about the halting behaviour of 
a Turing Machine. They will be useful in the sequel. 


DEFINITION 5.0.5. [Final States and Halting Behaviour of a Turing Ma- 
chine] (i) We say that a Turing Machine M enters a final state when making the 
move 7; >m 72 iff the state in the configuration 72 is a final state. 

(ii) We say that a Turing Machine M stops (or halts) in a configuration ai qaz iff 
no quintuple of M is of the form: q, X — qj, Xk, m, where X is the leftmost 
symbol of a2, for some state q; E€ Q, symbol X; € I, and value m € {L, R}. Thus, 
in this case no configuration y exists such that a, qaz—y V. 

(iii) We say that a Turing Machine M stops (or halts) in a state q iff no quintuple 
of M is of the form: q, Xn +— qj, Xk, m for some state q; E€ Q, symbols Xa, Xx ET, 
and value m € {L, R}. 

(iv) We say that a Turing Machine M stops (or halts) on the input w iff for the 
initial configuration gow there exists a configuration y such that: (i) qw >m 7, 
and (ii) M stops in the configuration y. 

(v) We say that a Turing Machine M stops (or halts) iff for every initial configuration 
qo w there exists a configuration y such that: (i) qow >y y, and (ii) M stops in the 
configuration ¥. 


In Case (v), instead of saying: ‘the Turing Machine M stops’ (or halts), we also 
say: ‘the Turing Machine M always stops’ (or always halts, respectively). Indeed, we 
will do so when we want to stress the fact that M stops for all initial configurations 
of the form qow, where qo is the initial state and w is an a input word (in particular, 
we have used this terminology in Lemma 4.1.4 on page 180). 


DEFINITION 5.0.6. [Language Accepted by a Turing Machine. Equiva- 
lence Between Turing Machines] Let us consider a deterministic Turing Machine 
M with initial state qo, and an input word w € &* for M. 


(1) We say that M answers ‘yes’ for w (or M accepts w) iff (1.1) there exist 
q € F,aı E T*, and az € Tt such that qow —>ïy &ıqaz, and (1.2) M stops in the 
configuration &ı qaz (that is, M stops in a final state, not necessarily the first final 
state which is entered by M). 
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(2) We say that M answers ‘no’ for w (or M rejects w) iff (2.1) for all configura- 
tions y such that qow —}, y, the state in y is not a final state, and (2.2) there 
exists a configuration y such that qow —%, y and M stops in y (that is, M never 
enters a final state and there is a state in which M stops). 

(3) The set {w |w € S* and qu 4, a1 qag for some q € F, a, € I*, and ag E rH} 
which is a subset of *, is said to be the language accepted by M and it denoted 
by L(M). Every word w in L(M) is said to be a word accepted by M, and for all 
w € L(M), M accepts w. A language accepted by a Turing Machine is said to be 
Turing computable. 


(4) Two Turing Machines Mı and Mə are said to be equivalent iff L(Mı) = L( Mə). 


When the input word w is understood from the context, we will simply say: M 
answers ‘yes’ (or ‘no’), instead of saying: M answers ‘yes’ (or ‘no’) for the word w. 

Note that in other textbooks, when introducing the concepts of Definition 5.0.6 
above, the authors use the expressions ‘recognizes’, ‘recognized’, and ‘does not rec- 
ognize’, instead of ‘accepts’, ‘accepted’, and ‘rejects’, respectively. 


REMARK 5.0.7. [Halting Hypothesis] Unless otherwise specified, we will as- 
sume the following hypothesis, called the Halting Hypothesis: 
for all Turing Machines M, for all initial configuration qow, and for all configura- 
tions y, if qw >ï; y and the state in y is final then no configuration 7’ exists such 
that y >. 7 (that is, the first time M enters a final state, M stops in that state). 
Thus, by assuming the Halting Hypothesis, we will consider only Turing Ma- 
chines which stop whenever they are in a final state. 


It is easy to see that this Halting Hypothesis can always be assumed without chang- 
ing the notions introduced in the above Definition 5.0.6. In particular, for any given 
Turing Machine M which accepts the language L, there exists an equivalent Turing 
Machine which complies with the Halting Hypothesis. 

As in the case of finite automata, we say that the notion of acceptance of a 
word w (or a language L) by a Turing Machine M is by final state, because the 
word w (or every word of the language L) is accepted by the Turing Machine M, 
if M is in a final state or ever enters a final state, as specified by Definition 5.0.6. 

The notion of acceptance of a word, or a language, by a nondeterministic Turing 
Machine is identical to that of a deterministic Turing Machine. 


DEFINITION 5.0.8. [Word and Language Accepted by a Nondeterminis- 
tic Turing Machine] A word w is accepted by a nondeterministic Turing Machine 
M with initial state qo, iff there exists a configuration y such that qow —}, y and 
the state of y is a final state. The language accepted a nondeterministic Turing 
Machine M is the set of words accepted by M. 


Sometimes in the literature, one refers to this notion of acceptance by saying that 
every nondeterministic Turing Machine has angelic nondeterminism. The qualifi- 
cation ‘angelic’ is due to the fact that a word w is accepted by a nondeterministic 
Turing Machine M if there exists a sequence of moves (and not ‘for all sequences of 
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moves’) such that M makes a sequence of moves from the initial configuration qo w 
to a configuration with a final state. 

Similarly to what happens for finite automata (see page 29), Turing Machines can 
be presented by giving their input alphabet and their transition functions, assuming 
that they are total (see also Remark 5.0.10 on page 189). Indeed, from the transition 
function of a Turing Machine M one can derive also its set of states and its tape 
alphabet. The transition function 6 of a Turing Machine can be represented as a 
multigraph, by representing each quintuple of 6 of the form: 


qi, Xn => j, Xk, m 
as an arc from node q; to a node q; labeled by ‘Xa (Xk, m} as follows: 


@=* (Xx, m) 0) 


In Figure 5.0.2 below we present a Turing Machine M which accepts the language 
{a"b" | n > 0}. Note that this language can also be accepted by a deterministic 
pushdown automaton, but it cannot be accepted by any finite automaton, because 
it is not a regular language. 


FIGURE 5.0.2. The transition function of a Turing Machine M which 
accepts the language {a"b” | n>0}. The input alphabet is {a,b}. The 
initial state is qo. The unique final state is q5. If the machine M halts 
in a state which is not final, then the input word is not accepted. The 
arc labeled by ‘B  (#, L)’ from state qı to state q is followed only on 
the first sweep from left to right, and the arc labeled by ‘# (#, LY 
from state qı to state q2 is followed in all other sweeps from left to 
right. 


Note also that deterministic pushdown automata are devices which are computation- 
ally ‘less powerful’ than Turing Machines, because deterministic pushdown automata 
accept deterministic context-free languages while, as we will see below, Turing Ma- 
chines accept type 0 languages. 

The Turing Machine M whose transition function is depicted in Figure 5.0.2, 
accepts the language {a”"b” | n>0} by implementing the following algorithm. 
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ALGORITHM 5.0.9. Acceptance of the language {a"b" |n>0} by the Turing Ma- 
chine M of Figure 5.0.2. The input alphabet is £ = {a,b}. The tape alphabet is 
r= {a,b, B, #}. 


Initially the tape head of M scans the leftmost cell; 
a: if the tape head reads the symbol B or # then M accepts the input 
word (which is the empty word if the symbol read is B) 
else begin 
- M performs a sweep to the right until # or B is found 
(B is found only when the first left-to-right sweep is performed) 
and during that sweep it changes the leftmost a into #; 
if this change is not possible then M rejects the input word; 
- M performs a sweep to the left until # is found and 
during that sweep it changes the rightmost b into #; 
if this change is not possible then M rejects the input word; 
-go to a 
end 


If during a left-to-right or right-to-left sweep of the tape head going from a symbol 
# to another symbol #, no change of character can be made according to Algo- 
rithm 5.0.9, then the input word should be rejected. We leave it to the reader to 
prove this fact. This proof is based on the property that if no change of character 
can be made, then the number of a’s is different from the number of b’s. 

As a consequence of our definitions, we have that a language L is accepted by 
some Turing Machine iff there exists a Turing Machine M such that for all words 
w E L, starting from the initial configuration qow, the state qo is a final state or the 
state of the Turing Machine M will eventually be a final state. For words which are 
not in L, the Turing Machine M may halt without ever being in a final state or it 
may run forever without ever entering a final state. 


REMARK 5.0.10. Without loss of generality, we may assume that the transition 
function 6 of any given Turing Machine is a total function by adding a sink state to 
the set Q of states as we have done in the case of finite automata (see Section 2.1). 
We stipulate that: (i) the sink state is not final, and (ii) for every tape symbol which 
is in the scanned cell, the transition from the sink state takes the Turing Machine 
back to the sink state. 


We have the following result which we state without proof (see [14]). 


THEOREM 5.0.11. [Equivalence of Deterministic and Nondeterministic 
Turing Machines] For any nondeterministic Turing Machine M there exists a 
deterministic Turing Machine which accepts the language L(M). 


There are other kinds of Turing Machines which have been described in the 
literature and one may want to consider. In particular, (i) one may allow the tape 


190 5. TURING MACHINES AND TYPE 0 GRAMMARS 


read-only input tape: 


» (the input tape head moves 
to the left and to the right) 


(the working tape head moves 
isch ide “apora E ™ to the left and to the right) 


ENE Ce 


FIGURE 5.0.3. An off-line Turing Machine (the lower tape may equiv- 
alently be two-way infinite or one-way infinite). 


working tape: 


of a Turing Machine to be two-way infinite, instead of one-way infinite, and (ii) one 
may allow k (>1) tapes, instead of one tape only. 

We will not give here the formal definitions of these kinds of Turing Machines. 
It will suffice to say that if a Turing Machine M has k (> 1) tapes then: (i) each 
move of the machine M depends on the sequence of k symbols which are read by 
the k heads, (ii) before moving to the left or to the right, each tape head prints 
a symbol on its tape, and (iii) after printing a symbol on its tape, each tape head 
moves either to the left or to the right, independently of the moves of the other tape 
heads. 

The following theorem tells us that these kinds of Turing Machines have no 
greater computational power with respect to the basic kind of Turing Machines 
which we have introduced in Definition 5.0.1. 


THEOREM 5.0.12. [Equivalence of Turing Machines with 1 One-Way In- 
finite Tape and k (>1) Two-Way Infinite Tapes] Given any nondeterministic 
Turing Machine M with k (>1) two-way infinite tapes, there exists a deterministic 
Turing Machine (with one one-way infinite tape) which accepts the language L(M). 


Now let us introduce the notion of an off-line Turing Machine (see also Fig- 
ure 5.0.3 on page 190). It is a Turing Machine with two tapes (and, thus, the moves 
of the machine are done by reading the symbols on the two tapes, and by changing 
the positions of the two tape heads) with the limitation that one of the two tapes, 
called the input tape, is a tape which contains the input word between the two spe- 
cial endmarker symbols ¢ and $. The input tape can be read, but not modified. 
Moreover, it is not allowed to use the input tape outside the cells where the input 
is written. The other tape of an off-line Turing Machine will be referred to as the 
working tape, or the standard tape. 


5.1. Equivalence Between Turing Machines and Type 0 Languages 


Now we can state and prove the equivalence between Turing Machines and type 0 
languages. 


THEOREM 5.1.1. [Equivalence Between Type 0 Grammars and Turing 
Machines. Part 1] For any language R C %*, if R is generated by the type 0 
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grammar G = (X, Vy, P, S), where X is the set of terminal symbols, Vy is the set 
of nonterminal symbols, P is the finite set of productions, and S is the axiom, then 
there exists a Turing Machine M such that L(M) = R. 


PROOF. Given the grammar G which generates the language R, we construct a 
nondeterministic Turing Machine M with two tapes as follows. Initially, on the first 
tape there is the word w to be accepted iff w € R, and on the second tape there is 
the sentential form consisting of the axiom S' only. Then M simulates a derivation 
step of w from S by performing the following Steps (1), (2), and (3). Step (1): M 
chooses in a nondeterministic way a production of the grammar G, say a — P, 
and an occurrence of a on the second tape. Step (2): M rewrites that occurrence 
of a by 8, thereby changing the string on the second tape. In order to perform 
this rewriting, M may apply the shifting-over technique for Turing Machines [9] by 
either shifting to the right if |a| < |8|, or shifting to the left if |a| > |G]. Step (3): 
M checks whether or not the string produced on the second tape is equal to the 
word w which is kept unchanged on the first tape. If this is the case, M accepts w 
and stops. If this is not the case, M simulates one more derivation step of w from 
S by performing again Steps (1), (2), and (3) above. 

We have that w € R iff w € L(M). 


THEOREM 5.1.2. [Equivalence Between Type 0 Grammars and Turing 
Machines. Part 2] For any language R C b* such that there exists a Turing Ma- 
chine M such that L(M) = R then there exists a type 0 grammar G = (£, Vy, P, Ai), 
where X is the set of terminal symbols, Vy is the set of nonterminal symbols, P is the 
finite set of productions, and A, is the axiom, such that R is the language generated 
by G, that is, R = L(G). 

PROOF. Given the Turing Machine M and a word w € &*, we construct a type 0 
grammar G which first makes two copies of w and then simulates the behaviour of 
M on one copy. If M accepts w then w € L(G), and if M does not accept w then 
w ¢ L(G). The detailed construction of G is as follows. 

Let M = (Q,%,T, qo, B, F,6). The productions of G are the following ones, where 


the pairs of the form [—,—] are elements of the set Vy of the nonterminal symbols: 
1. Ay > qo Ag 

The following productions nondeterministically generate two copies of w: 

2. Ag — [a,a] A2 for each a € X 


The following productions generate all tape cells necessary for simulating the com- 
putation of the Turing Machine M: 
3.1 A > |e, B] A2 
3.2 A — |e, B] 
The following productions simulate the moves to the right: 
4. qla, X] — la, Y] p 
for each a € X U {e}, 
for each p,q € Q, 
for each X eT, Y €T — {B} such that ô(q, X) = (p, Y, R) 
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The following productions simulate the moves to the left: 
5. [b, Z|] q[a, X] > p [b, Z] [a, Y] 
for each a,b E€ NU {e}, 
for each p,q € Q, 
for each X,Z €T, Y eT — {B} such that ô(q, X) = (p, Y, L) 


When a final state q is reached, the following productions propagate the state q to 
the left and to the right, and generate the word w, making q to disappear when all 
the terminal symbols of w have been generated: 


6.1 [a,X|q— qaq 
6.2 qla, X] > qaq 
6.3 q—>€ 
for each a E€ XU {e}, X ET, qeF 
We will not formally prove that all the above productions simulate the behaviour of 
M, that is, for any w € ©*, w € L(G) if w € L(M). 
The following observations should be sufficient: 


(i) the first components of the nonterminal symbols |[—, —] are never touched by the 
productions so that the given word w is kept unchanged, 
(ii) never a nonterminal symbol [—,—] is made to be a terminal symbol if a final 


state q is not encountered first, 

(iii) if the acceptance of a word w requires at most k (>0) tape cells, we have that 
the initial configuration of the Turing Machine M for the word w = a1 az . . . an, with 
n>0, on the leftmost cells of the tape, is simulated by the derivation: 


A; —* qola, a] [@2, a2]... [an, an] le, B] le, B]... le, B] 


where there are k (>n) nonterminal symbols to the right of qo. 


We end this chapter by recalling that every Turing Machine can be encoded by 
a natural number. This property has been used in Chapter 4 (see Definition 4.1.2 
on page 179). In particular, we will prove that there exists an injection from the set 
of Turing Machines into the set of natural numbers. Without loss of generality, we 
will assume that: (i) the Turing Machines are deterministic, (ii) the input alphabet 
of the Turing Machines is {0,1}, (iii) the tape alphabet of the Turing Machines is 
{0,1, B}, and (iv) the Turing Machines have one final state only. 

For our proof it will be enough to show that a Turing Machine M with tape 
alphabet {0, 1, B} can be encoded by a word in {0,1}*. Then each word in {0, 1}* 
with an 1 in front, is the binary expansion of a natural number. The desired encoding 
is constructed as follows. 

Let us assume that: 

- the set of states of M is {q;|1 < i < n}, for some value of n > 2, 

- the tape symbols 0, 1, and B are denoted by X,, X2, and X3, respectively, and 

- L (that is, the move to the left) and R (that is, the move to the right) are denoted 
by 1 and 2, respectively. 

The initial and final states are assumed to be qı and qo, respectively. 
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Then, each quintuple ‘qi, Xn > qj, Xk, m’ of M corresponds to a string of five 
positive numbers (i, h, j, k, mY. It should be the case that 1 < i,j <n, 1<h,k <3, 
and 1 < m < 2. Thus, the quintuple ‘qi, Xn —> qj, Xk, m’ can be encoded by the 
sequence: 1°01"01/01*01™. The various quintuples can be listed one after the other, 
so to get a sequence of the form: 

000 code of the first quintuple 00 code of the second quintuple 00... 000. (f) 


Every sequence of the form (}) encodes one Turing Machine only. 


REMARK 5.1.3. Since when describing a Turing Machine the order of the quin- 
tuples is not significant, a Turing Machine can be encoded by several sequences of 
the form (f). In order to get a unique sequence of the form (f), we take, among all 
possible sequences obtained by permutations of the quintuples, the sequence which 
is the binary expansion of the smallest natural number. 


There is an injection from the set N of natural numbers into the set of Turing 
Machines because for each n € N we can construct the Turing Machine which 
computes n. Thus, by the Bernstein Theorem (see Theorem 7.9.2 on page 235) we 
have that there is a bijection between the set of Turing Machines and the set of 
natural numbers. 


CHAPTER 6 


Decidability and Undecidability in Context-Free Languages 


Let us begin by recalling a few elementary concepts of Computability Theory which 
are necessary for understanding the decidability and undecidability results we will 
present in this chapter. More results can be found in [9] and the interested reader 
may refer to that book. 

DEFINITION 6.0.4. [Recursively Enumerable Language] Given an alphabet 
X, we say that a language L C »%* is recursively enumerable, or r.e., or L is a 
recursive enumerable subset of X*, iff there exists a Turing Machine M such that 
for all words w € &*, M accepts the word w iff w € L. 


If a language L C X* is r.e. and M is a Turing Machine that accepts L, we have 
that for all words w € &*, if w ¢ L then either (i) M rejects w or (ii) M ‘runs 
forever’ without accepting w, that is, for all configurations y such that qow — 7, 7; 
where gow is the initial configuration of M, there exists a configuration 7’ such that: 
(ii.1) y >m y and (ii.2) the states in y and 7’ are not final. 


Recall that the language accepted by a Turing Machine M is denoted by L(M). 

Given the alphabet ©, we denote by R.E. the class of the recursively enumerable 
languages subsets of &*. 

DEFINITION 6.0.5. [Recursive Language] We say that a language L C %* is 
recursive, or L is a recursive subset of &*, iff there exists a Turing Machine M such 
that for all words w € &*, (i) M accepts the word w iff w € L, and (ii) M rejects 
the word w iff w ¢ L (see also Definition 4.0.4 on page 173). 


Given the alphabet X, we denote by REC the class of the recursive languages 
subsets of &*. One can show that the class of recursive languages is properly con- 
tained in the class of the r.e. languages. 


Now we introduce the notion of a decidable problem. Together with that notion 
we also introduce the related notions of a semidecidable problem and an undecidable 
problem. We first introduce the following three notions. 


DEFINITION 6.0.6. [Problem, Instance of a Problem, Solution of a Prob- 
lem] Given an alphabet X, (i) a problem is a language L C &*, (ii) an instance of a 
problem L C &* isa word w € &*, and (iii) a solution of a problem L C ¥* is an algo- 
rithm, that is, a Turing Machine, which accepts the language L (see Definition 5.0.6 
on page 186). 


Given a problem L, we will also say that L is the language associated with that 
problem. 
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As we will see below (see Definitions 6.0.7 and 6.0.8), a problem L is said to be 
decidable or semidecidable depending on the properties of the Turing Machine, if 
any, which provides a solution of L. 


Note that an instance w € %* of a problem L C X* can be viewed as a mem- 
bership question of the form: «Does the word w belong to the language L?». For 
this reason in some textbooks a problem, as we have defined it in Definition 6.0.6 
above, is said to be a yes-no problem, and the language L associated with a yes-no 
problem is also called the yes-language of the problem. Indeed, given a problem L, 
its yes-language which is L itself, consists of all words w such that the answer to 
the question: «Does w belong to L?» is ‘yes’. The words of the yes-language L are 
called yes-instances of the problem. 


We introduce the following definitions. 


DEFINITION 6.0.7. [Decidable and Undecidable Problem] Given an alpha- 
bet X, a problem L C &* is said to be decidable (or solvable) iff L is recursive. A 
problem is said to be undecidable (or unsolvable) iff it is not decidable. 


As a consequence of this definition, every problem L such that the language L 
is finite, is decidable. 


DEFINITION 6.0.8. [Semidecidable Problem] A problem L is said to be 
semidecidable (or semisolvable) iff L is recursive enumerable. 


We have that the class of decidable problems is properly contained in the class of the 
semidecidable problems, because for any fixed alphabet X, every recursive subset of 
»* is a particular recursively enumerable subset of 4*, and there exists a recursively 
enumerable subset of X* which is not a recursive subset of *. 


Now, in order to fix the reader’s ideas, we present two problems: (i) the Primality 
Problem, and (ii) the Parsing Problem. 


EXAMPLE 6.0.9. [Primality Problem] The Primality Problem is the subset of 
{1}* defined as follows: 

Prime = {1" | n is a prime number}. 
An instance of the Primality Problem is a word of the form 1”, for some n>0. A 
Turing Machine M is a solution of the Primality Problem iff for all words of the 
form 1” with n>1, we have that M accepts w iff 1” € Prime. Obviously, the yes- 
language of the Primality Problem is Prime. We have that the Primality Problem 
is decidable. 

Note that we may choose other ways of encoding the prime numbers, thereby 
getting other equivalent ways of presenting the Primality Problem. 


EXAMPLE 6.0.10. [Parsing Problem] The Parsing Problem is the subset Parse 
of {0, 1}* defined as follows: 


Parse = {[G] 000 [w] | w € L(G)} 
where [G] is the encoding of a grammar G as a string in {0,1}* and fw] is the 
encoding of a word w as a string in {0,1}*, as we now specify. 
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Let us consider a grammar G = (Vr, Vy, P, S}. Let us encode every symbol of 
the set Vr U Vy U {>} as a string of the form 01” for some value of n, with n>1, 
so that two distinct symbols have two different values of n. Thus, a production of 
the form: 71...%m — Yı... Yn, for some m > 1 and n > 0, with the x;’s and the 
ys in Vr U Vy, will be encoded by a string of the form: 01*101* ...01*°0, where 
ki, ko,...,kp are positive integers and p = m+n-+1. The set of productions of the 
grammar G can be encoded by a string of the form: 00, ...0,0, where each cg; is the 
encoding of a production of G, and two consecutive 0’s denote the beginning and 
the end (of the encoding) of a production. Then |G] can be taken to be the string 
01**00,...0,0, where 01** encodes the axiom of G. We also stipulate that a string 
in {0,1}* which does not comply with the above encoding rules, is the encoding of 
a grammar which generates the empty language. 

The encoding [w] of a word w € VF as a string in {0,1}*, is a word of the form 
01%101*2 ...01%90, where ky, k2,...,kg are positive integers. 

An instance of the Parsing Problem is a word of the form [G] 000 fw], where: 
(i) [G] is the encoding of a grammar G, and (ii) [w] is the encoding of a word 
weVp. 

A Turing Machine M is a solution of the Parsing Problem if given a word of 
the form [G]000[w] for some grammar G and word w, we have that M accepts 
[G] 000 [w] iff w € L(G), that is, M accepts [G] 000 [w] iff [G] 000 [w] € Parse. 

Obviously, the yes-language of the Parsing Problem is Parse. 

We have the following decidability results if we restrict the class of the grammars 
we consider in the Parsing Problem. In particular, 

(i) if the grammars of the Parsing Problem L are type 1 grammars then the Parsing 
Problem is decidable, and 

(ii) if the grammars which are considered in the Parsing Problem L are type 0 
grammars then the Parsing Problem is semidecidable and it is undecidable. 


DEFINITION 6.0.11. [Property Associated with a Problem] With every 
problem L C %* for some alphabet X, we associate a property Pr such that P,(x) 
holds iff x € L. 


For instance, in the case of the Parsing Problem, Ppgrsing(x) iff x is a word in {0, 1}* 
of the form [G]000[w], for some grammar G and some word w such that w € L(G). 

Instead of saying that a problem L is decidable (or undecidable, or semidecid- 
able, respectively), we will also say that the associated property Py, is decidable (or 
undecidable, or semidecidable, respectively). 


REMARK 6.0.12. [Specifications of a Problem] As it is often done in the 
literature, we will also specify a problem {x | P;(x)} by using the sentence: 

« Given x, determine whether or not P(x) holds » 
or by asking the question: « Pz,(x)?» 
Thus, for instance, (i) instead of saying ‘the problem {x | Pr(a)}’, we will also say 


‘the problem of determining, given x, whether or not P,(x) holds’, and (ii) instead of 
saying ‘the problem of determining, given a grammar G, whether or not L(G) = X* 
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holds’, we will also ask the question ‘L(G) = b*?’ (see the entries of Table 1 on 
page 201 and Table 2 on page 222). 


We have the following results which we state without proof. 


FACT 6.0.13. (i) The complement X£*— L of a recursive set L is recursive. 
(ii) The union of two recursive languages is recursive. The union of two r.e. languages 
is r.e. 


For the proof of Part (i) of the above fact, it is enough to make a simple mod- 
ification to the Turing Machine M which accepts L. Indeed, given any w € &%, if 
M accepts (or rejects) w then the Turing Machine M1 which accepts X*— L, rejects 
(or accepts, respectively) w. 


THEOREM 6.0.14. [Post Theorem] If a language L and its complement “* — L 
are r.e. languages, then L is recursive. 


Thus, given any set L C &*, there are four mutually exclusive possibilities: 


(i) L is recursive and b*—L is recursive 
(ii) ZL is not r.e. and b*—L is not r.e. 
(iii.1) Lis r.e. and not recursive and %*—L is not r.e. 


(iii.2) L is not r.e. and &*—L is r.e. and not recursive 


As a consequence, in order to show that a problem is unsolvable and its associated 
language L is not recursive, it is enough to show that &*—L is not r.e. 


An alternative technique for showing that a problem is unsolvable and its asso- 
ciated language is not recursive, is the so called reduction technique which can be 
described as follows. We say that a problem A whose associated yes-language is L4, 
subset of &*, is reduced to a problem B whose associated yes-language is Lpg, also 
subset of &*, iff there exists a total, computable function, say r, from L4 to Lg such 
that for every word w in &*, w is in La iff r(w) is in Lg. Thus, if the problem B 
is decidable then the problem A is decidable, and if the problem A is undecidable 
then the problem B is undecidable. 

Now let us consider a problem, called the Halting Problem. It is defined to be 
the set of the encodings of all pairs of the form (Turing Machine M, word w) such 
that M halts on the input w. Thus, the Halting Problem can also be formulated 
as follows: given a Turing Machine M and a word w, determine whether or not M 
halts on the input w. 

We have the following result which we state without proof [9, 14]. 


THEOREM 6.0.15. [Turing Theorem] The Halting Problem is semidecidable 
and it is not decidable. 


By reduction of the Halting Problem, one can show that also the following two 
problems are undecidable: 


(i) Blank Tape Halting Problem: given a Turing Machine M, determine whether or 
not M halts in a final state when its initial tape has blank symbols only, and 
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(ii) Uniform Halting Problem (or Totality Problem): given a Turing Machine M 
with input alphabet X, determine whether or not M halts in a final state for every 
input word in &*. 


6.1. Some Basic Decidability and Undecidabilty Results 


In this section we present some more decidability and undecidability results about 
context-free languages (see also [9, Section 8.5]) besides those which have been 
presented in Section 3.14. 

By using the fact that the so called Post Correspondence Problem is undecidable 
(see [9, Section 8.5]), we will show that it is undecidable whether or not a given 
context-free grammar G is ambiguous, that is, it is undecidable whether or not there 
exists a word w generated by G such that w has two distinct leftmost derivations 
(see Theorem 6.1.3 on page 199). 


An instance of the Post Correspondence Problem, PCP for short, over the alpha- 
bet ÈX, is given by (the encoding of) two sequences of k words each, say (u1,..., Ux) 
and (v1,..., Ux), where the u;’s and the v,’s are elements of X*. A given instance of 
the PCP is a yes-instance, that is, it belongs to the yes-language of the PCP, iff there 
exists a sequence (i1,...,%) of indexes, with n > 1, taken from the set {1,...,k}, 
such that the following equality between two words of X*, holds: 


Uig + +s Uin = Vig + + Uin 


This sequence (71,...,7n) of indexes is called a solution of the given instance of the 


PCP. 


THEOREM 6.1.1. [Unsolvability of the Post Correspondence Problem] 
The Post Correspondence Problem over the alphabet X, with |X| >2, is unsolvable 
if its instances are given by two sequences of k words, with k>2. 


PROOF. One can show that the Halting Problem can be reduced to it. 


THEOREM 6.1.2. [Semisolvability of the Post Correspondence Problem] 
The Post Correspondence Problem is semisolvable. 


PROOF. One can find the sequence of indexes which solves the problem, if there 
exists one, by checking the equality of the two words corresponding to the two 
sequences of indexes taken one at a time in the canonical order over the set {1,...,k} 
(where we assume that 1< ... <k), that is, 1, 2, ..., k, 11, 12, ..., 1k, 21, 22,..., 
2k, ... kk, 111, ..., kkk,... 


There is a variant of Post Correspondence Problem, called the Modified Post Corre- 
spondence Problem, where it is assumed that in the solution sequence 7; is 1. Also 
the Modified Post Correspondence Problem is unsolvable. 


THEOREM 6.1.3. |Undecidability of Ambiguity for Context-Free Gram- 
mars| The ambiguity problem for context-free languages is undecidable. 
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PROOF. It is enough to reduce the PCP to the ambiguity problem of context-free 
grammars. Consider a finite alphabet © and two sequences of k (> 1) words, each 
word being an element of X*: 


U = (u,..., Uk}, and 
V = (U1, -3 Uk} 


Let us also consider the set A of k new symbols {a1,..., ap} such that X N A = 9, 
and the following two languages which are subsets of (X U A)*: 


Ur = {Un Ui -Uili -aili |r > 1 and 1< Gi ton io <k}, and 
VL = {UV dda: |r > 1 and 1< ii, i2,...,ip < k}. 

A grammar G for generating the language Uz U Vz is as follows: 

(XU A, {5, Sv, Sv}, P, S), where P is the following set of productions: 


S — Sy 
Sy > u; Sy a; | Uia; for any i = 1,..., k, 
S — Sy 
Sy 3 vi Sv a; | Vidi for any i = 1,..., k. 


Now in order to prove the theorem we need to show that the instance of the PCP 
for the sequences U and V has a solution iff the grammar G is ambiguous. 
(only-if part) If ui... Ui, = Vi -Ui for some n > 1, then we have that the word w 
which is 

Uig Wig + + + Win Vin + + + Vig Qi, 
is equal to the word 

Vij Vig «+ + Vin Qin «+ + Mig Qi; 


and w has two leftmost derivations: 

(i) a first derivation which first uses the production S — Sy, and 
(ii) a second derivation, which first uses the production S — Sy. 
Thus, G is ambiguous. 


(if part) Assume that G is ambiguous. Then there are two leftmost derivations for a 
word generated by G. Since every word generated by Sy has one leftmost derivation 
only, and every word generated by Sy has one leftmost derivation only (and this is 
due to the fact that the a;’s symbols force the uniqueness of the productions used 
when deriving a word from Sy or Sy), it must be the case that a word generated 
from Sy is the same as a word generated from Sy. This means that we have: 


Uj Wig «+ + Ui, Qin ++ + VigQi, = Viz Vig «+ + Vin Qin ++ + Vig Qi, 
for some sequence (71, i2,..., in} of indexes with n > 1, where each index is taken 


from the set {1,...,k}. 
Thus, Up Ui cats Ui, = Vi Vig -Ui 


n 


and this means that the corresponding PCP has the solution (i4, t2, ..., in) o 
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Is &*—L(G) a context-free language ? 
Is L(G) N L(G2) a context-free language ? 


TABLE 1. Undecidable problems for S-extended context-free gram- 
mars. The grammars G, Gi, and Go are S-extended context-free 
grammars with terminal alphabet X. R is a regular language, possibly 
including the empty string £. Explanations about these problems are 
given in Section 6.1.1. 


6.1.1. Basic Undecidable Properties of Context-Free Languages . 


We start this section by listing in Table 1 on page 201 some undecidability results 
about S-extended context-free grammars with terminal alphabet ©. 

For understanding these results and the others decidability and undecidability 
results we will list in the sequel, it is important that the reader correctly identifies 
the infinite set of instances of the problems to which those results refer. For example, 
an instance of Problem (1.a) is given by (the encoding of) a context-free grammar G 
and (the encoding of) a regular grammar which generates the language R, and an 
instance of Problem (5) is given by (the encoding of) a context-free grammar G. 

Let us now make a few comments on the undecidable problems listed in Table 1. 


e Problem (1.a) is undecidable in the sense that it does not exist a Turing Ma- 
chine which given an S-extended context-free grammar G and a regular language R, 
which may also include the empty string £, always terminates and answers ‘yes’ 
iff L(G) = R. This problem is not even semidecidable because the negated prob- 
lem, that is, «L(G) # R?», is semidecidable and not decidable (recall Post Theo- 
rem 6.0.14 on page 198). Problem (1.b) is undecidable in the same sense of Prob- 
lem (1.a), but instead of the formula L(G) = R one should consider the formula 
RC L(G). 

e Problem (2.a) is undecidable in the sense that it does not exist a Turing Machine 
which given two S-extended context-free grammar G4 and G2, always terminates and 
answers ‘yes’ iff L(G) = L(G2). Problem (2.b) is undecidable in the same sense 
of Problem (2.a), but instead of the formula L(G) = L(G2), one should consider 
the formula L(G) C L(G2). Problems (2.a) and (2.b) are not even semidecidable, 
because the negated problems are semidecidable and not decidable. 


e Problem (3.a) is undecidable in the sense that it does not exist a Turing Machine 
which given and S-extended context-free grammar G with terminal alphabet X, 
always terminates and answers ‘yes’ iff L(G) = *. Actually, Problem (3.a) is 
not even semidecidable because its complement is semidecidable and not decidable. 
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Problems (3.b) is undecidable in the same sense of Problem (3.a), but instead of the 
formula L(G) = &*, one should consider the formula L(G) = X7. 

Problems (5) is undecidable in the same sense of Problem (3.a), but instead of the 
formula L(G) = %*, one should consider the formula ©* — L(G) = A, for some 
context-free language A. 


e Problem (4) is undecidable in the sense that it does not exist a Turing Machine 
which given two S-extended context-free grammar G and G2, always terminates and 
answers ‘yes’ iff L(G,) NL(G2) = Ø. Actually, Problem (4) is not even semidecidable 
because its complement is semidecidable and not decidable. 

Problem (6) is undecidable in the same sense of Problem (4), but instead of the 
formula L(G) O L(G2) = 0, one should consider the formula L(G,) N L(G2) = L, 


for some context-free language L. 


With reference to Problem (1.b) of Table 1 above, note that given a context-free 
grammar G and a regular language R, it is decidable whether or not L(G) C R. This 
follows from the following facts: (i) L(G) C R iff L(G) A (©*— R) = f, (ii) 4*-R 
is a regular language, (iii) L(G) N (X£*— R) is a context-free language because the 
intersection of a context-free language and a regular language is a context-free lan- 
guage (see Theorem 3.13.4 on page 158), and (iv) it is decidable whether or not 
L(G) N (£*—R) = 0, because the emptiness problem for the language generated by 
a context-free grammar is decidable (see Theorem 3.14.1 on page 159). 

The construction of the context-free grammar, say G1, which generates the lan- 
guage L(G)M(X*—R) can be done in two steps: (iii.1) we first construct the pda M 
accepting L(G) N (=*—R) as indicated in |9, pages 135-136] and in the proof of 
Theorem 3.13.4 on page 158, and then (iii.2) we construct G4 as the context-free 
grammar which is equivalent to M (see the proof of Theorem 3.1.14 on page 104). 


Here are some more undecidability results relative to context-free languages. (We 
start the numbering of these results from (7) because the results (1.a)—(6) are those 
listed in Table 1 on page 201.) 


(7) It is undecidable whether or not a contezt-sensitive grammar generates a context- 
free language |2, page 208]. 


(8) It is undecidable whether or not a context-free grammar generates a regular 
language. This result is a corollary of Theorem 6.1.6 below. 


(9) It is undecidable whether or not a context-free grammar generates a prefix-free 
language. Indeed, this problem can be reduced to the problem of checking whether 
or not two context-free languages have empty intersection |8, page 262]. Note that 
if we know that the given language is a deterministic context-free language then the 
problem is decidable |8, page 355]. 


(10) It is undecidable whether or not the language L(G) generated by a context-free 
grammar G, can be generated by a linear context-free grammar (see Definition 3.1.22 
on page 110 and Definition 7.6.7 on page 228). 
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(11) It is undecidable whether or not a context-free grammar generates a determin- 
istic context-free language (see Fact 3.16.1 on page 169). 


(12) It is undecidable whether or not a context-free grammar is ambiguous. 
(13) It is undecidable whether or not a context-free language is inherently ambiguous. 


Now we present a theorem which allows us to show that it is undecidable whether 
or not a context-free grammar defines a regular language. We need first the following 
two definitions. 


DEFINITION 6.1.4. [Languages Effectively Closed Under Concatenation 
With Regular Sets and Union] We say that a class C of languages is effectively 
closed under concatenation with regular sets and union iff there exists a Turing 
Machine which for all pairs of languages L1 and L2 in C and all regular languages 
R, from the encodings (for instance, as strings in {0,1}*) of the grammars which 
generate L1, £2, and R, constructs the encodings of the grammars which generate 
the following three languages: 


(i) R- L1, (ii) L1- R, (iii) L1 U L2, 
and these languages are in C. 


DEFINITION 6.1.5. [Quotient of a Language] Given an alphabet £, a language 
L C &*, and a symbol b € X, we say that the set {w|wb € L} is the quotient 
language of L with respect to b. 


THEOREM 6.1.6. [Greibach Theorem on Undecidability| Let us consider a 
class C of languages which is effectively closed under concatenation with regular sets 
and union. Let us assume that for that class C the problem of determining, given 
a language L, whether or not L = &* for any sufficient large cardinality of X, is 
undecidable. Let P be a nontrivial property of C, that is, P is a non-empty subset 
of C and P is different from C. 

If P holds for all regular sets and it is preserved under quotient with respect to 
any symbol in X, then P is undecidable for C'. 


By this Theorem 6.1.6, it is undecidable whether or not a context-free grammar 
defines a regular language (see the undecidability result (8) on page 202 and also 
Property (D5) on page 204 below). Indeed, we have that: 

(1) the class of context-free languages is effectively closed under concatenation with 
regular sets and union, and for context-free languages it is undecidable the problem 
of determining whether or not L = %* for |X| > 2, 

(2) the class of regular languages is a nontrivial subset of the context-free languages, 
(3) the property of being a regular language obviously holds for all regular languages, 
and 

(4) the class of regular languages is closed under quotient with respect to any symbol 
in & (see Definition 6.1.5 above). Indeed, it is enough to delete the final symbol 
in the corresponding regular expression. (Note that in order to do so it may be 
necessary to apply first the distributivity laws of Section 2.7.) 
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Theorem 6.1.6 allows us to show that also inherent ambiguity for context-free 
languages is undecidable. We recall that a context-free language L is said to be 
inherently ambiguous iff every context-free grammar G generating L is ambiguous, 
that is, there is a word of L which has two distinct leftmost derivations according 
to G (see Section 3.12). 

We also have the following result. 


FACT 6.1.7. [|Undecidability of the Regularity Problem for Context-Free 
Languages| (i) It does not exist an algorithm which always terminates and given 
a context-free grammar G, tells us whether or not there exists a regular grammar 
equivalent to G. (ii) It does not exist an algorithm which given a context-free 
grammar G, if the language generated by G is a regular language, then teminates 
and constructs a regular grammar which generates that language. 


Point (i) of the above Fact 6.1.7 is the undecidability result (8) of page 202 and 
should be contrasted with Property (D5) of deterministic context-free languages 
(see page 204). Point (ii) follows from the fact that the problem «L(G) = R?» is 
undecidable and not semidecidable (see Problem (1.a) on page 201). 

In the following two sections we list some decidability and undecidability results 
for the class of deterministic context-free languages. We divide these results into 
two lists: 

(i) the list of the decidable properties of deterministic context-free languages which 
are undecidable for context-free languages (see Section 6.2), and 

(ii) the list of the undecidable properties of deterministic context-free languages 
which are undecidable also for context-free languages (see Section 6.3). 


6.2. Decidability in Deterministic Context-Free Languages 


The following properties are decidable for deterministic context-free languages. 
These properties are undecidable for context-free languages in the sense that we 
will indicate in Fact 6.2.1 below |9, page 246]. 

We assume that the terminal alphabet of the grammars and languages under 
consideration is © with |X] > 2. 

Given a deterministic context-free language L and a regular language R, it is 
decidable to test whether or not: 


(DI) L=R, 
(D2) RCL, 
(D3) L = &*, that is, the complement of L is empty, 


(D4) &*—L is a context-free language (recall that the complement of a deterministic 
context-free language is a deterministic context-free language), 


(D5) L is a regular language, that is, it is decidable whether or not there exists a 
regular language R1 such that L = R1 (note that, since the proof of this property is 
constructive, one can effectively exhibit the finite automaton which accepts R1 [24]), 
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(D6) L is prefix-free [8, page 355]. 


FACT 6.2.1. If L is known to be a context-free language (not a deterministic 
context-free language), then the above Problems (D1)—(D6) are all undecidable. 


Recently the following result has been proved [19]. 


(D7) It is decidable given any two deterministic context-free languages L1 and L2, 
to test whether or not L1 = L2. 


Note that, on the contrary, as we will see in Section 6.3, it is undecidable given 
any two deterministic context-free languages L1 and L2, to test whether or not 
L1 C L2. 

Note also that it is undecidable given any two context-free grammars G1 and 
G2, to test whether or not L(G1) = L(G2) (see Section 6.1.1 starting on page 201). 


6.3. Undecidability in Deterministic Context-Free Languages 


The following properties are undecidable for deterministic context-free languages. 
These properties are undecidable also for context-free languages in the sense that we 
will indicate in Fact 6.3.1 below |9, page 247]. 

We assume that the terminal alphabet of the grammars and languages under 
consideration is © with |X| > 2. 


Given any two deterministic context-free languages L1 and L2, it is undecidable 
to test whether or not: 
(U1) LIN L2 =, 
(U2) L1 C L2, 
(U3) L1 N L2 is a deterministic context-free language, 
(U4) L1 U L2 is a deterministic context-free language, 
(U5) 
t 
( 
( 


U5) L1 + L2 is a deterministic context-free language, where + denotes concatena- 


Eai 


on of languages, 


U6) L1* is a deterministic context-free language, 


U7) L1 N L2 is a context-free language. 


Fact 6.3.1. Ifthe languages L1 and L2 are known to be context-free languages 
(and it is not known whether or not they are deterministic context-free languages, 
or L1 or L2 is a deterministic context-free language) and in (U3)-(U6) we keep the 
word ‘deterministic’, then the above Problems (U1)—(U7) are still undecidable. 


6.4. Undecidable Properties of Linear Context-Free Languages 


The results presented in this section refer to the linear context-free languages and 
are taken from |9, pages 213-214]. The definition of linear context-free languages is 
given on page 110 (see Definition 3.1.22). 

We assume that the terminal alphabet of the grammars and languages under 
consideration is © with |X| > 2. 
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(U8) It is undecidable given a context-free language L, to test whether or not L is 
a linear context-free language. 


It is undecidable given a linear context-free language L, to test whether or not: 
U9) L is a regular language, 


U10) the complement of L is a context-free language, 


U12) L is equal to X*. 


U13) It is undecidable given a linear context-free grammar L, to test that all linear 
context-free grammars generating L are ambiguous grammars, that is, it is undecid- 
able given a linear context-free grammar L, to test whether or not for every linear 
context-free grammar G generating L, there exists a word in L with two different 
leftmost derivations according to G. (Obviously, L may be generated also by a 
context-free grammar which is not linear.) 


( 

(U10) 

(U11) the complement of L is a linear context-free language, 
(U12) 

( 


CHAPTER 7 


Appendices 


7.1. Iterated Counter Machines and Counter Machines 


In a pushdown automaton the alphabet T of the stack can be reduced to two symbols 
without loss of computational power. However, if we allow one symbol only, we 
loose computational power. In this section we will present the class of pushdown 
automata, called counter machines, in which one symbol only is allowed in a cell 
different from the bottom cell of the stack. 

Actually, there are two kinds of counter machines: (i) the iterated counter ma- 
chines [8], and (ii) the counter machines, tout court. Note that, unfortunately, 
some textbooks (see, for instance, [9]) refer to iterated counter machines as counter 
machines. 

Let us begin by defining the iterated counter machines. 


DEFINITION 7.1.1. [Iterated Counter Machine, or (0?+1—1)-counter Ma- 
chine| An iterated counter machine, also called a (0?+1—1)-counter machine, is 
a pda whose stack alphabet has two symbols only: Zp) and A. Initially, the stack, 
also called the iterated stack or iterated counter, holds only the symbol Zo at the 
bottom. Zo may occur only at the bottom of the stack. All other cells of the stack 
may have the symbol A only. An iterated counter machine allows on the stack the 
following three operations only: (i) test-if-0, (ii) add 1, and (iii) subtract 1. 


The operation ‘test-if-0’ tests whether or not the top of the stack is Zo. The 
operation ‘add 1’ pushes one A onto the stack, and the operation ‘subtract 1’ pops 
one A from the stack. 

For any n > 0, we assume that the stack stores the value n by storing n symbols 
A’s and the symbol Zp at the bottom. Before subtracting 1, one can test if the 
value 0 is stored on the stack and this test avoids performing the popping of Zo, 
which would lead the iterated counter machine to a configuration with no successor 
configurations because the stack is empty. 


DEFINITION 7.1.2. [Counter Machine, or (+ —1)-counter Machine| A 
counter machine, also called a (+1—1)-counter machine, is a pda whose stack al- 
phabet has one symbol only, and that symbol is A. Initially, the stack, also called 
the counter, holds only one symbol A at the bottom. All cells of the stack may 
have the symbol A only. A counter machine allows on the stack the following two 
operations only: (i) ‘add 1’, and (ii) ‘subtract 1’. 


The operation ‘add 1’ pushes one A onto the stack, and the operation ‘subtract 1’ 
pops one A from the stack. Before subtracting 1, one cannot test if after the 
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subtraction, the stack becomes empty. If the stack becomes empty, the counter 
machine gets into a configuration which has no successor configurations. 


Iterated counter machines and counter machines behave as usual pda’s as far as 
the reading of the input tape is concerned. Thus, the transition function 6 of any 
iterated counter machine (or counter machine) is a function from Q x UU {e} xT 
to the set of finite subsets of Q x [*. A move of an iterated counter machine (or 
a counter machine) is made by: (i) reading a symbol or the empty string from the 
input tape, (ii) popping the symbol which is the top of the stack (thus, the stack 
should not be empty), (iii) changing the internal state, and (iv) pushing a symbol 
or a string of symbols onto the stack. 

As for pda’s, also for iterated counter machines (or counter machines) we assume 
that when a move is made, the symbol on top of the iterated counter (or the counter) 
is popped. Thus, for instance, the string o € ['* which is the output of the transition 
function ô of an iterated counter machine, is such that: (i) |o| = 2 if we add 1, 
(ii) |a| =1 if we test whether or not the top of the stack is Zo, that is, we perform 
the operation ‘test-if-0’, and (iii) |o|=0 (that is, o =€) if we subtract 1. 


As the pda’s, also the iterated counter machines and the counter machines are 
assumed, by default, to be nondeterministic machines. However, for reasons of clar- 
ity, sometimes we will explicitly say ‘nondeterministic iterated counter machines’, 
instead of ‘iterated counter machines’, and analogously, ‘nondeterministic counter 
machines’, instead of ‘counter machines’. 

We have the following notions of deterministic iterated counter machines and 
deterministic counter machines. They are analogous to the notion of deterministic 
pda’s (see Definition 3.3.1 on page 117). 


DEFINITION 7.1.3. [Deterministic Iterated Counter Machine and Deter- 
ministic Counter Machine] Let us consider an iterated counter machine (or a 
counter machine) with the set Q of states, the input alphabet X, and the stack 
alphabet T = {Zo, A} (or {A}, respectively). We say that the iterated counter 
machine (or a counter machine) is deterministic iff the transition function 6 from 
Q x LU {e} xT to the set of finite subsets of Q x I* satisfies the following two 
conditions: 


(i) YqE Q, YZ ET, if ôlq,£, Z) # {} then Va EX, d(g,a,Z)={}, and 
(ii) Va EQ, VZ ET, Va Ee UU {e}, (q, x, Z) is either {} or a singleton. 


As for pda’s, acceptance of an iterated counter machine (or a counter machine) 
M is defined by final state, in which case the accepted language is denoted by L(V), 
or by empty stack, in which case the accepted language is denoted by N(M). 


REMARK 7.1.4. Recall that, as for pda’s, acceptance of an input string by a 
nondeterministic (or deterministic) iterated counter machine (or counter machine) 
may take place only if the input string has been completely read (see Remark 3.1.9 
on page 103). 
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FACT 7.1.5. [Equivalence of Acceptance by final state and by empty 
stack for Nondeterministic Iterated Counter Machines] For each nonde- 
terministic iterated counter machine which accepts a language L by final state there 
exists a nondeterministic iterated counter machine which accepts L by empty stack, 
and vice versa |8, pages 147-148]. 


Thus, as it is the case for nondeterministic pda’s, the class of languages accepted 
by nondeterministic iterated counter machines by final state is the same as the class 
of languages accepted by nondeterministic iterated counter machines by empty stack 
(see Theorem 3.1.10 on page 103). 


NOTATION 7.1.6. [Transitions of Iterated Counter Machines and Counter 
Machines] When we depict the transition function of an iterated counter machine 
or a counter machine, we use the following notation (an analogous notation has been 
introduced on page 107 for the pda’s). An edge from state A to state B of the form: 


==) 


where x is the symbol read from the input, and y is the symbol on the top of the 
stack, means that the machine may move from state A to state B by: (1) reading x 
from the input, (2) popping y from the (iterated) counter, and (3) pushing w onto 
to the (iterated) counter so that the leftmost symbol of w becomes the new top of 
the counter (actually, for counter machines we need not specify the new top of the 
counter because only the symbol A can occur in the counter). 


We have the following fact. 


FACT 7.1.7. (1) A deterministic counter machine accepts by empty stack the 
one-parenthesis language, denoted Lp, generated by the grammar with axiom P 
and productions: 

Pee E 
(2) A nondeterministic iterated counter machine accepts by empty stack the iterated 


one-parenthesis language, denoted Lr, generated by the grammar with axiom R and 
productions: 


R> () | (R) | RR 
and there is no nondeterministic counter machine which accepts by empty stack the 
language Lp. 
(3) There is no nondeterministic iterated counter machine which accepts by empty 


stack the iterated two-parenthesis language, denoted Lp, generated by the grammar 
with axiom D and productions: 


D>QOIUIM | [2] | BD 


PROOF. Point (1) is shown by Figure 7.1.1 on page 210 where we have depicted 
the deterministic counter machine which accepts by empty stack the language Lp. 
That figure is depicted according to Notation 7.1.6 on page 209. 
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FIGURE 7.1.1. A deterministic counter machine which accepts by 
empty stack the language generated by the grammar with axiom P 
and productions P — () | (P). We assume that after pushing the 
string bı ...bn onto the stack, the new top symbol is bı. The word 
()) is not accepted because the second closed parenthesis is not read 
when the stack is empty. 


(, ee y Ae. 
(, Zo AZo ER ey eS 
TENERS 


FIGURE 7.1.2. A nondeterministic iterated counter machine which 
accepts by empty stack the language Lr. The nondeterminism is due 
to the two arcs outgoing from q2. We assume that after pushing the 
string bı... bn onto the stack, the new top symbol is bı. 


The first part of Point (2) is shown by Figure 7.1.2 on page 210 where we have 
depicted a nondeterministic iterated counter machine which accepts by empty stack 
the language Lr. That figure is depicted according to Notation 7.1.6 on page 209. 

The second part of Point (2), that is, the fact that the language Lpr cannot be 
recognized by any nondeterministic counter machine by empty stack, follows from 
the following facts. 

Without loss of generality, we assume that for counting the open and closed 
parentheses occurring in the input word, we have to push exactly one symbol A 
onto the stack for each open parenthesis, and we have to pop exactly one symbol A 
from the stack for each closed parenthesis. 

When we have read the prefix of an input word in which the number of open 
parentheses is equal to the number of closed parentheses, the stack cannot be empty 
because, otherwise, the word ()() cannot be accepted (recall that when the stack is 
empty no move is possible). But if the stack is not empty, it should be made empty 
because the acceptance is by empty stack. Now, in order to make the stack empty, 
we must have at least two transitions of the following form (and this fact makes the 
counter machine to be nondeterministic): 


(, A 


i AA 
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[ Zo l, Zo 


FIGURE 7.1.3. A deterministic pda which accepts by final state the 
language generated by the grammar with axiom D and productions 
D > () | [] | (D) | [P] | DD. An arrow labeled by 
‘pı, P2 pıpz stands for the four arrows obtained for pı = ( or |, 
and p= (or |. In the pair pı, pọ the symbol p; is the input character 
and pə is the top of the stack. We assume that after pushing the string 
pı p2 onto the stack, the new top of the stack is pı. 


leaving from the state p which is reached when the prefix of the input word in Lr 
has the number of open parentheses equal to the number of closed parentheses. 
However, since: (i) the counter machine should pop one A from the stack for each 
closed parenthesis, (ii) the counter machine cannot store the value of n, for any given 
n, using a finite automaton, and (iii) the counter machine cannot know whether or 
not the symbol at hand is the last closed parenthesis of a word of the form (")” 
(because by Point (ii), when it reads a A on the top of the stack it cannot know 
whether or not it is the only one left on the stack), the counter machine would accept 
also any input word of the form (")"*', but such words should not be accepted. 
Point (3) follows from the fact that in order to accept the two-parenthesis lan- 
guage Lp, any nondeterministic iterated counter machine has to keep track of both 
the number of the round parentheses and the number of square parentheses, and 
this cannot be done by having one iterated counter only (see also Section 7.3 be- 
low). Indeed, a nondeterministic iterated counter machine with one iterated counter 
cannot encode two numbers into one number only. 


Note that the languages Lp, Lr, and Lp of Fact 7.1.7 on page 209 are deter- 
ministic context-free languages, and they can be accepted by a deterministic pda by 
final state. Figure 7.1.3 shows the deterministic pda which accepts by empty stack 
the language Lp. That figure is depicted according to Notation 7.1.6 on page 209 
and, in particular, we assume that when we push the string w onto the stack, the 
new top is the leftmost symbol of w. 
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FIGURE 7.1.4. The stack of a nondeterministic counter machine can 
be made empty starting from any final state f, by adding one extra 
state fı and two extra ¢-transitions which do not use any symbol of 
the input: a first transition from f to fı and a second transition from 


fi to fi. 


Now we present three facts which, together with Fact 7.1.5 on page 209, prove 
the following relationships among classes of automata (these relationships are based 
on the classes of languages which are accepted by the automata): 


nondeterministic iterated counter machines with acceptance by final state 
nondeterministic iterated counter machines with acceptance by empty stack 
nondeterministic counter machines with acceptance by empty stack 
nondeterministic counter machines with acceptance by final state 


IV V 


In these relationships: (i) = means ‘same class of accepted languages’, (ii) > means 
‘larger class of accepted languages’, and (iii) > means > or =. 


Fact 7.1.8. Nondeterministic iterated counter machines accept by empty stack 
a class of languages which is strictly larger than the class of languages accepted by 
empty stack by nondeterministic counter machines. 


PROOF. This fact is a consequence of Fact 7.1.7 Point (2) on page 209. L 


FACT 7.1.9. Nondeterministic counter machines accept by empty stack a class 
of languages which includes the class of languages accepted by final state by non- 
deterministic counter machines. 


PROOF. The proof is based on the fact that from every final state f we can 
perform a sequence of -moves which makes the stack empty, as indicated in Fig- 
ure 7.1.4. 


Fact 7.1.10. Nondeterministic iterated counter machines accept by final state 
a class of languages which is strictly larger than the class of languages accepted by 
final state by nondeterministic counter machines. 


PROOF. This fact is a consequence of Fact 7.1.5 on page 209, Fact 7.1.8 on 
page 212, and Fact 7.1.9 on page 212. 


FACT 7.1.11. [Acceptance by final state and by empty stack are Incom- 
parable for Deterministic Counter Machines] For deterministic counter ma- 
chines the class of languages accepted by final state is incomparable with the class 
of languages accepted by empty stack. 
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FIGURE 7.1.5. (a) A deterministic counter machine which accepts 
by final state the language A = {a?"|n>1}. (8) A nondeterministic 
counter machine which accepts by empty stack the language A. The 
number of a’s which have been read from the input word is odd in the 
states qı and pı, while it is even in the states q and pə. 


Indeed, we have the following Facts 7.1.12 and 7.1.13. 


Fact 7.1.12. (i) The language A = {a° |n > 1} is accepted by final state by 
a deterministic counter machine, and (ii) there is no deterministic counter machine 
which accepts the language A by empty stack. 


PROOF. (i) The language A is accepted by the deterministic counter machine 
depicted in Figure 7.1.5 (aœ) (actually the language A is accepted by a finite automa- 
ton). 

(ii) This follows from the fact that when the stack is empty no move is possible. 
Thus, it is impossible for a deterministic counter machine to accept aa and aaaa, 
both belonging to A, and to reject aaa which does not belong to A. Note that 
there exists a nondeterministic counter machine which accepts by empty stack the 
language A as shown in Figure 7.1.5 (3). With reference to that figure we have that: 
(i) the stack may be empty only in state pg, (ii) in state pı an odd number of a’s 
of the input word has been read, and (iii) in state pọ an even number of a’s of the 
input word has been read. Recall also that in order to accept an input word w, all 
symbols of w should be read. 


Fact 7.1.13. (i) The language B = {a"b" |n>1} is accepted by empty stack by 
a deterministic counter machine, and (ii) there is no deterministic counter machine 
which accepts the language B by final state. 


PROOF. (i) The language B is accepted by empty stack by the deterministic 
counter machine depicted in Figure 7.1.6. This machine is obtained from that of 
Figure 7.1.1 by replacing the symbols ‘(’ and ‘)’ by the symbols a and b, respectively. 


(ii) This is a consequence of the following two points: (ii.1) a finite number of states 
cannot recall an unbounded number of a’s, and (ii.2) there is no way of testing that 
the number of a’s is equal to the number of b’s without making the stack empty and 
if the stack is empty, no more moves can be made so to enter a final state. 
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FIGURE 7.1.6. A deterministic counter machine which accepts by 
empty stack the language {a"b" | n> 1}. 


Note that if in Figure 7.1.6 we make the state q3 to be a final state, then also 
words which are not of the form a”b” are accepted (for instance, the word aab is 
accepted). O 


We have also the following fact. 


Fact 7.1.14. (i) The language C = {a™b"c|n > m > 1} is accepted by empty 
stack by a deterministic iterated counter machine. 
(ii) There is no deterministic counter machine which can accept the language C by 
empty stack. 
(iii) The language C is accepted by empty stack by a nondeterministic counter 
machine. 


PROOF. (i) The language C is accepted by empty stack by the deterministic 
iterated counter machine depicted in Figure 7.1.7. 
Point (ii) follows from the fact that in order to count the number of b’s and make 
sure that n is greater than or equal to m, one has to leave the counter empty. Then 
no more moves can be made and the input symbol c cannot be read. 
Point (iii) is shown by the construction of the nondeterministic counter machine of 
Figure 7.1.8. That machine accepts the language C by empty stack. Indeed, given 
the input string a’b"c, with n>m->1, after the sequence of m a’s, the counter of 
the counter machine of Figure 7.1.8 holds m+1 A’s. Then, by reading the n b’s 
from the input string, it can pop off the counter at most n A’s. Since n>m, there 
exists a sequence of moves which leaves exactly one A on the counter. This last A is 
popped when reading the last symbol c. Note that, for n<m, there is no sequence 
of moves which leaves exactly one A on the counter and thus, the transition due to 
the last symbol c cannot leave the counter empty. 0 


We close this section by recalling the following two facts concerning the iterated 
counter machines, the counter machines, and the Turing Machines: 


(i) Turing Machines are as powerful as finite state automata with one-way input 
tape and two deterministic iterated counters, and 


(ii) Turing Machines are more powerful than finite state automata with one-way 
input tape and two deterministic counters. 
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FIGURE 7.1.7. A deterministic iterated counter machine which ac- 

cepts by empty stack the language {a’b"c | n>m>1}. 
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FIGURE 7.1.8. A nondeterministic counter machine which accepts by 
empty stack the language {a’b"c | n>m>1}. The nondeterminism 
is due to the two loops from state qo to state qo. 


7.2. Stack Automata 


In Chapter 3 we have considered the class of pushdown automata. In this section 
we will consider a related class of automata which are called stack automata |9, 25]. 
They are defined as follows. 


DEFINITION 7.2.1. [Stack Automaton or Stack Machine] A stack automa- 
ton or a stack machine (often abbreviated as SA, short for stack automaton) is a 
pushdown automaton with the following two additional features: (i) the read-only 
input tape is a two-way tape with left and right endmarkers, that is, the input head 
can move to the left and to the right, and (ii) the head of the stack can behave as 
for a pda, but it can also look at all the symbols in the stack in a read-only mode, 
without being forced to pop symbols off the stack. 


In the stack of a SA we have a bottom-marker which allows us to avoid reaching 
configurations which do not have successor configurations (like, for instance, those 
with an empty stack). 

When the stack head scans the top of the stack, an SA can either (i) push a 
symbol, or (ii) pop a symbol, or (iii) can move down the stack without pushing or 
popping symbols. 

The class of deterministic SA’s is called DSA. The class of nondeterministic SA’s 
is called NSA. In our naming conventions we add the prefix ‘NE-’ for denoting that 
the stack automata are non-erasing, that is, they never pop symbols off the stack. 
For instance, NE-NSA is the class of the nondeterministic SA’s such that they never 
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pop symbols off the stack. We also add the prefix ‘1-’ to denote that the input tape 
is one-way, that is, the head of the input tape moves to the right only. 

In Figure 7.2.1 we have depicted the containment relationships among some 
classes of stack automata and some complexity classes. For these complexity classes 
the reader may refer to [9, Chapter 14]. In that figure an edge from class B (below) 
to class A (above) denotes that B C A. Some of the containments in that figure are 
proper. In particular, we have that the class of deterministic context-free languages, 
denoted DCF, is properly contained in the class of context-free languages, denoted 
CF, and the class of context-free languages is properly contained in the class of 
context-sensitive languages, denoted CS. (Recall that we assume that the empty 
word € may occur in the classes of languages DCF, CF, and CS.) 


= Uaso DTIME(2””) 
ae, f 


= Uaso DTIME(2°%!98%) NE-NSA = NSPACE(n?) 


bees a 


NE-DSA = DSPACE(n log n) = NSPACE(n 
DSPACE(n) "y 
O NSA 
1NE-NSA° 1-DSA 
1NE-DSA 


FIGURE 7.2.1. Relationships among some complexity classes for non- 
deterministic (NSA) and deterministic (DSA) stack automata and 
their subclasses. An arrow from class B class A denotes that B C A. 
The prefix ‘NE-’ means that the automaton is non-erasing, that is, 
symbols are never popped off the stack. The prefix ‘1-’ means that 
the input tape is one-way, that is, the head of the input tape moves 
to the right only. DCF, CF, and CS are the classes of the deter- 
ministic context-free languages, the context-free languages, and the 
context-sensitive languages, respectively (see |9, page 393]). The ar- 
rows marked by ‘e’ show that the nondeterministic classes include the 
corresponding deterministic classes. 
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From Figure 7.2.1 the reader can see that the ‘computational power’ of the 
nondeterministic machines is, in general, not smaller than the ‘computational power’ 
of the corresponding deterministic machines (see the edges marked by ‘e’). 

The classes of 1-NSA and 1NE-NSA are full AFL (see Section 7.6 starting on 
page 225). 


7.3. Relationships Among Various Classes of Automata 


In this section we summarize some basic results on equivalences and containments for 
various classes of automata. Some of these results have been already mentioned in 
previous sections of the book. Some other results may be found in [9]. Equivalences 
and containments will refer to the class of languages which are accepted by the 
automata. 

Let us begin by relating Turing Machines and finite automata with stacks or 
iterated counters or queues. 


Turing Machines of various kinds. 

> Turing Machines (with acceptance by final state) are equivalent to: (i) finite 
automata with two stacks, or (ii) finite automata with two deterministic iterated 
counters, or (iii) finite automata with one queue (these kind of automata are called 
Post Machines) (see, for instance, [9]). 

> Nondeterministic Turing Machines are equivalent to deterministic Turing Ma- 
chines. 

> Off-line Turing Machines are equivalent to standard Turing Machines (that is, 
Turing Machines as introduced in Definition 5.0.1 on page 184). Off-line Turing 
Machines do not change their computational power if we assume that the input 
word on the input tape has both a left and a right endmarker, or a right endmarker 
only. Obviously, we may assume that the input word has no endmarkers if the input 
word is placed on the working tape, that is, we consider a standard Turing Machine. 
> Turing Machines with acceptance by final state, are more powerful than nondeter- 
ministic pda’s with acceptance by final state. Nondeterministic pda’s are equivalent 
to finite automata with one stack only. 


Nondeterministic and deterministic pushdown automata. 
> Nondeterministic pda’s with acceptance by final state are equivalent to nondeter- 
ministic pda’s with acceptance by empty stack. 
> Nondeterministic pda’s with acceptance by final state are more powerful than 
deterministic pda’s with acceptance by final state. 

In particular, the language 

N = {a¥ b” | (m=k or m=2k) and m,k>1} 
is a nondeterministic context-free language which can be accepted by final state by a 
nondeterministic pda, but it cannot be accepted by final state by any deterministic 
pda. A grammar which generates the language N has axiom S and the following 
productions: 


S—>L|R L—aLb]| ab R—aRbb|abb 
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FIGURE 7.3.1. A nondeterministic pda which accepts by final state 
the language N generated by the grammar with axiom S and the 
productions: S => L| R, L—aLb |ab, R—-aRbb |abb. 


This grammar is unambiguous, that is, no word has two distinct parse trees (see 
Definition 3.12.1 on page 155). In Figure 7.3.1 we have depicted the nondeterministic 
pda which accepts by final state the language N. In that figure we used the same 
conventions used of Figures 7.1.1 and 7.1.3. In particular, when the string bı... bn 
is pushed onto the stack, then the new top symbol is the leftmost symbol bı. 


Note that a pushdown automaton can be simulated by a finite automaton with 
two deterministic iterated counters, and a finite automaton with two deterministic 
iterated counters is equivalent to a Turing Machine. 


> Deterministic pda’s with acceptance by final state are more powerful than deter- 
ministic pda’s with acceptance by empty stack. 

> However, if we restrict ourselves to languages which enjoy the prefix property 
(see Definition 3.3.9 on page 120) then deterministic pda’s with acceptance by final 
state accept exactly the same class of languages which are accepted by deterministic 
pda’s with acceptance by empty stack. 


Deterministic pushdown automata and deterministic counter machines 
with n counters. 
> For any n > 0, the class of the deterministic pda’s with acceptance by final state 
is incomparable with the class of the deterministic counter machines with n counters 
with acceptance by all n stacks empty. 

This result is proved by Points (A) and (B) below. The formal definition of 
a deterministic counter machine with n counters is derived from Definitions 7.1.2 
and 7.1.3 on pages 207 and 208, respectively, by allowing n counters, instead of one 
counter only. In each move of a deterministic counter machine with n counters the 
configuration of one or more counters may change simultaneously. A deterministic 
counter machine with n counters cannot make any move if all counters are empty 
or if it tries to perform an ‘add 1’ or a ‘subtract 1’ operation on a counter that is 
empty. 
Point (A). A language which is accepted by a deterministic pda by final state and 
it is not accepted by any deterministic counter machine with n counters, for any 
n > 0, with acceptance by all n counters empty, is the iterated two-parenthesis 


7.3. RELATIONSHIPS AMONG VARIOUS CLASSES OF AUTOMATA 219 


language generated by the grammar with axiom D and the following productions 
(see Fact 7.1.7 on page 209 and Figure 7.1.3 on page 211): 


D > () | [] | (D) | [D] | DD 


The proof of this fact is similar to the proof of Fact 7.1.7 Point (2) on page 210. In 
particular, we have that in order to accept a word with balanced parentheses of the 
form: ("F (P2 F2... (Pa [Fn]kn)ha | Jk2)halki)h we need at least the computational 
power of a deterministic counter machine with 2n counters. Recall also that the 
encoding of two numbers by one number only is not possible when we have counters, 
because it is not possible to test when a counter holds the value 0. 


Point (B). Now we present a language L which is not context-free (and thus, it can 
be accepted neither by a nondeterministic pda nor a deterministic pda) and it is 
accepted by a deterministic counter machine with two counters with acceptance by 
the two counters empty. 

Let us start by considering, for i = 1,2, the parenthesis language Li generated 
by the context-free grammar with axiom S; and productions S; — a; 5;b; | a; b;. 
The symbol a; corresponds to an open parenthesis and the symbol b; corresponds 
to a closed parenthesis. Then, we consider the language L which is made out of the 
words each of which is an interleaving of a word of Lı and a word of Lə. Recall that, 
for instance, the interleavings of the two words w, = a,b, and w2 = azbə are the 
following six words: 

a b1daabe (= wiw), a1đa2b1b2, aı1a2b2b1, a2a1b1b2, a2Q1 bob, a2b2a1b1 (= ww). 
Now we have that L is not a context-free language. Indeed, let us assume, by 
absurdum, that L were context-free. Then, the intersection of L with the regular 
language aj a3 bï bš should be context-free. But this is not the case (see language L4 
on page 152). We leave it to the reader to show that L is accepted by a deterministic 
counter machine with two counters with acceptance by the two counters empty. 


Hierarchy of deterministic counter machines with n counters, for n>1. 

> For any n>1, deterministic counter machines with n+1 counters with acceptance 
by alln+1 counters empty, are more powerful than deterministic counter machines 
with n counters with acceptance by all n counters empty. 

More formally, for all n > 1, for all deterministic counter machines M with n 
counters which accepts a language L with acceptance by all counters empty, there 
exists a deterministic counter machine M’ with n+1 counters which accepts L with 
acceptance by all counters empty. 

This result can be established as follows. First, note that the machine M should 
made at least one move in which it makes its n counters empty and, thus, accepts a 
word in L. The first move of the machine M” is equal to the first move of M, except 
that in that move M’ also makes its (n+1)-st counter empty. Then the machine M’ 
proceeds by making the same sequence of moves made by the machine M. 

Now, for any n > 1, we present a language L, which can be accepted by a 
deterministic counter machine with m counters, with m>n, with acceptance by all 
counters empty, but it cannot be accepted by a deterministic counter machine with 
a number of counters smaller than n, with acceptance by all counters empty. The 
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a2, Ay AAs ai, A, AA; 


tix Ay i ok ie. 
ot aoe a e) 


E by, Ao € by, Ay 


FIGURE 7.3.2. A deterministic counter machine with two counters 
which accepts the parenthesis language L(P2) by the two counters 
empty. The productions for P) are: P, — aPb) | aP bz 
and P, > a, Pi bı | aibi. For i = 1,2, by A; we denote the symbol 
A on the counter i. 


language Ln is the language ‘with n different kinds of parentheses’ generated by the 
grammar with axiom P,, and the following productions: 

P, > an Pabn | an Paid, +++ Pe az Pobe | az Pibo Pia, Pi by | ay by 
For i = 1,...,n, the symbol a; corresponds to the open parenthesis of kind ¿i and 
the symbol b; corresponds to the closed parenthesis of kind 7. The counters 1,...,n, 
are used by the accepting machine for counting the numbers of the ay,’s,...,@n’s, 
respectively, while the counters n+1,...,m are made empty on the first move and 
never used henceforth (see also Figure 7.3.2). 

For every n>1, Ln is a deterministic context-free language, and it is accepted 
with acceptance by empty stack by a deterministic pda. That deterministic pda, 
whose construction is left to the reader, can be derived from the one of Figure 7.1.3 
on page 211 by making some minor modifications and considering n kinds of paren- 
theses, instead of the square parentheses and the round parentheses only. 


Deterministic iterated counter machines with one iterated counter and 
deterministic counter machines with one counter. 
> Deterministic iterated counter machines with one iterated counter (see Defini- 
tion 7.1.1 on page 207) with acceptance by final state are more powerful than deter- 
ministic counter machines with one counter (see Definition 7.1.2 on page 207) with 
acceptance by empty stack. 

In particular, we have that the language 

E = {w € {0,1}* | equal number of occurrences of 0’s and 1’s in w} 
is accepted by a deterministic iterated counter machine with acceptance by final 
state (see Figure 7.3.3 where we used Notation 7.1.6 on page 209), but it cannot be 
accepted by a deterministic counter machine with acceptance by empty stack. This 
result is due to the fact that there is a word w € E such that w0 ¢ E andw01 € E. 
For that input word w, in fact, the counter should become empty, but then no move 
can be made for accepting w01 (recall that for accepting an input word, that word 
should be completely read). 
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FIGURE 7.3.3. A deterministic iterated counter machine with one 
iterated counter which accepts by final state (when the input string is 
completely read) the language {w | w € {0,1}* and in w the number 
of 0’s is equal to the number of 1’s}. 


With reference to Figure 7.3.3 recall that Zo is initially at the bottom of the 
iterated counter, and in any other cell of the iterated counter only the symbol A 
may occur. In state 1, if there is a character 1 in input then we add 1 to the iterated 
counter (that is, we push one A), and if there is a character 0 in input then we 
subtract 1 from the iterated counter (that is, we pop one A). Similarly, in state 0, 
if there is a character 0 in input then we add 1 to the iterated counter (that is, we 
push one A), and if there is a character 1 in input then we subtract 1 from the 
iterated counter (that is, we pop one A). When the string bı b> is pushed on the 
iterated counter, the new top symbol is b4. 

Note that, if we consider the language E $, instead of E (that is, we consider an 
endmarker for each input word), then we can accept E $ by a deterministic counter 
machine by empty stack. In that case, in fact, we can store in the counter one 
extra symbol A which we will pop only when the symbol $ is read from the input. 
Obviously, the language E $ can be accepted by empty stack also by a deterministic 
iterated counter machine. We leave it to the reader to construct that iterated counter 
machine. 


7.4. Decidable Properties of Classes of Languages 


In Table 2 on page 222 we summarize some decidable and undecidable properties of 
various classes of languages and grammars in the Chomsky Hierarchy. In this table 
REG, DCF, CF, CS, and Type 0, denote the classes of regular languages, deter- 
ministic context-free languages, context-free languages, context-sensitive languages, 
and Type 0 languages, respectively. We assume that REG, CF, and CS also denote 
the classes of grammars corresponding to those classes of languages. 

For Problems (a)—(g) of Table 2, the input language L(G) of the class REG (or 
CF, or CS, or Type 0) is given by a grammar of the class REG (or CF, or CS, or 
Type 0, respectively). The input language L(G) of the class DCF is given, as we said 
on page 169, by providing either (i) the instructions of a deterministic pda which 
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Language L(G) 


Problem 


c= 
b) Is L(G) empty ? Is L(G) finite ? U (8) | U (10) 
(Grae (1) U U 


( 

( 

( 

( 

(e) Is L(G) context-free ? S (yes) | U (13) 
(£) Is L(G) regular? U U 

(g) Is L(G) inherently ambiguous ? U U 

(h) Is grammar G ambiguous? U U 


TABLE 2. Decidability and undecidability of problems for various 
classes of languages and grammars. REG, DCF, CF, CS, and Type 0 
stands for regular, deterministic context-free, context-free, context- 
sensitive, and type 0, respectively. S, S (yes), and S (no) mean 
solvable, solvable with answer ‘yes’, and solvable with answer ‘no’, 
respectively. U means unsolvable. Entries in positions (1)—(13) are 
explained in Remarks (1)—(13), respectively, starting on page 222. 


accepts it, or (ii) a context-free grammar which is an LR(k) grammar, for some 
k>1 [15, Section 5.1]. Recall also that any deterministic context-free language can 
be generated by an LR(1) grammar [9, page 260-261]. 

For Problem (h) of Table 2 the input grammar G of the class DCF is given by 
providing an LR(k) grammar, for some k> 1 [9, Section 10.8] (see also Remark (12) 
on page 223). 

For the results shown in Table 2, except for those concerning Problem (c): 
«L(G) = d*?», it is not relevant whether or not the empty word € is allowed 
in the classes of languages REG, DCF, CF, and CS (see also Remark (1) below). 

An entry S in Table 2 means that the problem is solvable. An entry S (yes) means 
that the problem is solvable and the answer is ‘yes’. Likewise for the answer ‘no’. 
An entry U means that the problem is unsolvable. 

Note that the two problems: (i) «Is L(G) finite?» and (ii) «Is L(G) infinite ?» 
have the same decidability properties for the classes of languages REG, DCF, CF, 
CS, Type 0, that is, either they are both decidable or they are both undecidable. 


Now we make some remarks on the entries of Table 2 on page 222. 
REMARK (1). The problem «L(G) = %*?» is trivial for the classes of languages 
REG, DCF, CF, and CS, if we assume that those languages cannot include the 
empty string €. However, here we assume that: (i) the languages in REG, DCF, 
and CF are generated by extended grammars, that is, grammars that may have extra 
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productions of the form A — e, and (ii) the languages in the class CS are generated 
by grammars that may have the production S — £ with the start symbol S which 
does not occur in the right hand side of any production. With these hypotheses, 
the problem of checking whether or not L(G) = &*, is not trivial and it is solvable 
or unsolvable as shown in Table 2. The problem «L(G) = ©*?» will have entries 
equal to the ones listed in Table 2 for the problem «L(G) = %*?» if we assume 
that REG, DCF, CF, and CS denote classes of languages which are generated by 
grammars without any production whose right hand side is e. 

REMARK (2). This problem can be solved by constructing the finite automaton 
which is equivalent to the given grammar. 

REMARK (3). Having constructed the finite automaton F corresponding to the given 
grammar G, we have that: (i) L(G) is empty iff there are no final states in F, 
(ii) L(G) is finite iff there are no paths from a state to itself in F. 

REMARK (4). Having constructed the minimal finite automaton M corresponding 
to the given grammar G, we have that L(G) is equal to X* iff M has one state only 
and for each symbol in © there is an arc from that state to itself. 

REMARK (5). This problem has been shown to be solvable in [19]. Note that for 
deterministic context-free languages the problem «L1 C L2?» is unsolvable (see 
Property (U2) on page 205). Recall that a deterministic context-free language can be 
given either by an LR(1) grammar that generates it, or by a deterministic pushdown 
automaton that recognizes it. 

REMARK (6). The problem of determining given a context-free grammar G, whether 
or not L(G) = &* is undecidable (see Section 6.1.1). 

REMARK (7). The problem of determining given two context-free grammars G1 and 
G2, whether or not L(G1) = L(G2) is undecidable (see Section 6.1.1). 

REMARK (8). The problem of determining whether or not a context-sensitive gram- 
mar generates an empty language [9, page 230] is undecidable, and it is also unde- 
cidable the problem of determining whether or not a context-sensitive generates a 
finite language |8, page 295]. 

REMARK (9). The membership problem for a type 0 grammar is a 4j-problem of 
the Arithmetical Hierarchy (see, for instance, [14, 18]). 

REMARK (10). The problem of deciding whether or not given a type 0 grammar G, 
the language L(G) is empty is a I],-problem of the Arithmetical Hierarchy (see, for 
instance, [14, 18]). 

REMARK (11). This problem is solvable because from a given right linear (or left 
linear) regular grammar G we may construct a (possibly nondeterministic) finite 
automaton F using Algorithm 2.2.2 on page 34 (or Algorithm 2.4.7 on page 42, 
respectively). Then G is ambiguous iff F is not a deterministic finite automaton. 
REMARK (12). For all k > 1 (and, in particular, also for k = 1), the problem of 
deciding given an LR(k) grammar G, whether or not G is ambiguous, is trivially 
solvable (with answer ‘no’) |9, page 261]. 
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Boolean 
class of languages closed under not closed under. |Algebra? 


type 0 

Context-Sensitive 
Context-Free 

Deterministic Context-Free 


Regular 


TABLE 3. Algebraic and closure properties for various classes of lan- 
guages. The operations indicated in this table are explained in Sec- 
tion 7.5. Entries in positions (1)—(5) are explained in Remarks (1)—(5), 
respectively, starting on page 224. 


REMARK (13). The problem of determining whether or not a context-sensitive gram- 
mar generates a context-free language is undecidable |2, page 208]. 


7.5. Algebraic and Closure Properties of Classes of Languages 


Table 3 on page 224 shows some algebraic and closure properties of some classes of 
languages of the Chomsky Hierarchy. 

The operations =, x, and — on languages have been defined in Section 1.1 start- 
ing on page 9. The operations U and N on languages are defined as the union 
and intersection operations on sets. The operation rev has been defined in Defini- 
tion 2.12.3 on page 95. Note, in particular, that the classes of regular languages and 
context-sensitive languages are Boolean Algebras, if we interpret in a set theoretical 
sense the boolean operations lub, glb, complement, 0, and 1, that is, if we interpret 
them as union, intersection, Arv.*—x, 0, and *, respectively. 

We assume that the empty word £ can be an element of the Regular, Determin- 
istic Context-Free, Context-Free, and Context-Sensitive languages. 

Now we make some remarks on the entries of Table 3. 

REMARK (1). If we assume that the empty word € is not an element of any context- 
sensitive language (the assumption that € is not an element of any context-sensitive 
language is also done in |9, page 271]) then for context-sensitive languages the Kleene 
closure, denoted by *, should be replaced by the positive closure, denoted by *. 
REMARK (2). The proof of the fact that the class of context-sensitive languages is 
closed under ~ is in [10, 21]. 

REMARK (3). The fact that the class of the deterministic context-free languages is 
closed under ~, is stated in Theorem 3.17.1 on page 169. 

REMARK (4). By the Post Theorem, if a set A and its complement X*— A (with 
respect to &* for some given alphabet X) are both r.e., then A and &*—A are both 
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recursive. Thus, the class of type 0 languages which is the class of r.e. languages, is 
not closed under ~. 

REMARK (5). Both {a b c | i>1, j>1} and {a b ce | i>1, 7>1} are context-free 
and their intersection is {a’ b'c’ | i>0} which is not context-free. The complement 
of a context-free language is, in general, a context-sensitive language. 


7.6. Abstract Families of Languages 


In this section we deal with classes of languages defined by the closure properties 
they enjoy. All languages we consider in this section are over some alphabet which 
is assumed to be finite. 

The interested reader is encouraged to look at [9, Chapter 11] for further infor- 
mation and results on this subject. 

The following definition introduces four classes of languages, namely, 
(i) the trio’s, (ii) the full trio’s, (iii) the AFL’s, and (iv) the full AFL’s. 
The reader will find the notions of homomorphism and ¢-free homomorphism in Def- 
inition 1.7.2 on page 27 , and the notion of inverse homomorphism in Definition 1.7.4 
on page 28. 


DEFINITION 7.6.1. [Trio, Full Trio, AFL, full AFL] (i) A trio is a set of 
languages which is closed under ¢-free homomorphism, inverse homomorphism, and 
intersection with regular languages. 

(ii) A full trio is a set of languages which is closed under homomorphism, inverse 
homomorphism, and intersection with regular languages. 

(iii) An Abstract Family of Languages (or AFL, for short) is a set of languages which 
is a trio and it is also closed under concatenation, union, and * closure. 

(iv) A full Abstract Family of Languages (or full AFL, for short) is a set of languages 
which is a full trio and it is also closed under concatenation, union, and * closure. 


Obviously, the closure under homomorphism and the * closure extend the closure 
under e-free homomorphism and the * closure, respectively. One can show that: 
(i) the set of all regular languages each of which does not include the empty word e, 
is the smallest trio and also the smallest AFL |9, page 270 and 278], and 
(ii) the set of all regular languages each of which may also include the empty word e, 
is the smallest full trio and also the smallest full AFL [9, page 270 and 278]. 


Now we will give the definitions which introduce three closure properties. These 
definitions are parametric with respect to the choice of two, not necessarily distinct, 
finite alphabets. Let us call them A and B. 


Let REG, be the set of regular languages, each of which is a subset of A*, and 
let C be a family of languages, each of which is a subset of B*. 

Given any language R € REG, and any substitution o from A to C (see Defini- 
tion 1.7.1 on page 27), that is, for all a € A, o(a) is a language in C, let us consider 
the following language, which is a subset of B* (not necessarily in C): 


Lro ={w|n>0 and a...a,€R and w €o(a,)+... + o(a,)} (L1) 
where +» denotes language concatenation. 
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Then we consider the class SinR(REG 4, C) of all languages of the form of Lrg, 
for every possible choice of the regular language R € REG, and the substitution o 
from A to C. Formally, we have that: 


SinR(REG4,C) = {Lro |R € REG, and for all a €A, o(a) € C} 


DEFINITION 7.6.2. [Closure under SinR and ¢-freeR-SinR] (i) A class C 
of languages is said to be closed under substitution into the regular languages of 
REG, (or SinR, for short) iff SinR(REG,,C) C C. 

(ii) A class C of languages is said to be closed under substitution into the ¢-free 
regular languages of REG, (or ¢-freeR-SinR, for short) iff (i) SnR(REGy,C) C C 
and (ii) when constructing the languages Lr, (see Definition (L1) above) we assume 


that for all R € REG4, we have that € ¢ R. 


Let C be a family of languages each of which is a subset of A*. 

Given any language D € C and any substitution o from A to REGy, that is, for 
all a € A, o(a) is a language in REG ,g, let us consider the following language, which 
is a subset of A* (not necessarily in C): 

Lpo ={w|n>0 and a,...a,¢€ D and w €oa(a1)*... + a(Gn)} (L2) 
where + denotes language concatenation. 

Then we consider the class SbyR(C, REG 4) of all languages of the form of Lp, 
for every possible choice of the language D € C and the substitution ø from A 
to REG4. Formally, we have that: 


SbyR(C, REG4) ={£p,.|D€C and for all a €A, o(a) € REG4} 


DEFINITION 7.6.3. [Closure under SbyR and ¢-freeR-SbyR] (i) A class 
C of languages is said to be closed under substitution by the regular languages of 
REG, (or SbyR, for short) iff SbyR(C, REG) C C. 
(ii) A class C of languages is said to be closed under substitution by the e-free regular 
languages of REG, (or e-freeR-SbyR, for short) iff (i) SbyR(C,REG,4) C C and 
(ii) when constructing the languages of the form Lp,, (see Definition (L2) above) 
we assume that for all a € A, the empty word £ does not belong to the regular 
language o(a) EREG4. 


In the following Definition 7.6.4 we present the closure property under substitu- 
tion. As the reader may verify, Definition 7.6.4 can be obtained from Definition 7.6.2 
by replacing REG, by C, that is, by considering R to be a language in C, instead 
of a regular language in REG4. Equivalently, the Definition 7.6.4 can be obtained 
from Definition 7.6.3 by replacing REG, by C. 


Let C be a family of languages, each of which is a subset of A*. 

Given any language D € C and any substitution o from A to C, that is, for all 
a € A, o(a) is a language in C, let us consider the following language, which is a 
subset of A* (not necessarily in C): 


Lpo ={w|n>0 and a...an E D and w €o(a,)*... + a(Gn)} 
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(a) (b) (c) (d) (e) 
trio || e-free-h K! AR e-free-GSM GSM! 
full trio h Kt AR GSM GSM! 
AFL || e-free-h K1! NR + U + | e-free-GSM GSM! | e-freeR-SinR 
full AFL hhinR Ux GSM GSM! SinR 


TABLE 4. Columns (b), (c) and (d) show the closure properties of the 
classes of languages indicated on the same row in Column (a). The 
class of languages indicated in a row of Column (a) is, by definition, 
the class of languages which enjoys the closure properties listed in the 
same row of Column (b). The abbreviations used in this table are 
explained in Points (1)—(11) starting on page 227. 


Then we consider the class Subst(C) of all languages of the form of Lp,,, for every 
possible choice of the language D € C and the substitution o from A to C. Formally, 
we have that: 


Subst(C) = {Lpo |D EC and for all a EA, o(a) € C} 


DEFINITION 7.6.4. [Closure under Substitution] A class C of languages is 
said to be closed under substitution (or Subst, for short) iff Subst(C) C C. 


We state without proofs the following results. For the notions of: (i) GSM mapping, 
(ii) e-free GSM mapping, and (iii) inverse GSM mapping, the reader may refer to 
Definition 2.11.2 on page 93 and Definition 2.11.3 on page 93. 


In Table 4 we show various closure properties of the families of languages: (i) trio, 
(ii) full trio, (iii) AFL, and (iv) full AFL. These families of languages are indicated 
in Column (a). The properties we have listed in a row of Column (b) hold, by 
definition, for the family of languages indicated in the same row of Column (a). 


In that table we have used the following abbreviations: 


1) h stands for closure under homomorphism (see Definition 1.7.2 on page 27), 

2) e-free-h stands for closure under ¢-free homomorphism (that is, the empty 
word € is not in the image of h), 

3) Wt stands for closure under inverse homomorphism, 

4) AR stands for closure under intersection with regular languages, 

5) GSM stands for closure under GSM mapping, 

6) e-free-GSM_ stands for closure under e-free GSM mapping, 

7) GSM! stands for closure under inverse GSM mapping, 

8) stands for closure under language concatenation, 

9) U stands for closure under language union, 

0) + stands for * closure (see page 10), and 

1) « stands for * closure (see page 9). 
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For instance, Table 4 tells us that: (i) any trio is closed under ¢-free GSM mapping, 
inverse GSM mapping, and ¢-freeR-SbyR (see the first row which shows the prop- 
erties of trio’s), and (ii) any full trio is closed under GSM mapping, inverse GSM 
mapping, and SbyR (see the second row which shows the properties of full trio’s). 


The result stated for the AFL’s in Column (d) of Table 4 can be slightly im- 
proved. Indeed, it can be shown that: 

for each AFL, if in that AFL there exists a language L such that € € L, then 
that AFL is closed under SinR, and not only ¢-freeR-SinR [9, Theorem 11.5 on 
page 278]. 


FACT 7.6.5. [Closure Under Substitution of the Classes of Languages 
REG, CF, CS, REC, and R.E.| Regular languages (with the empty word e€ 
allowed), context-free languages (with the empty word € allowed), context-sensitive 
languages (with the empty word € allowed), recursive sets, and r.e. sets are closed 
under Subst (see also [9, page 278]). 


In Table 5 on page 229 we have shown some examples of full trios, AFL’s, and 
full AFL’s. REG, e-free REG, LIN, DCF, CF, s-free CF, CS, e-free CS, REC, and 
R.E. denote, respectively, the class of regular, ¢-free regular, linear context-free, 
deterministic context-free, context-free, <-free context-free, context-sensitive, ¢-free 
context-sensitive, recursive, and recursively enumerable languages. 

We already know these classes of languages, except for the e-free classes which 
we will now define. 


DEFINITION 7.6.6. [Epsilon-Free Class of Languages and Epsilon-Free 
Language| A class C of languages is said to be ¢-free if the empty word € is not an 
element of any language in C. A language L is said to be e-free if the empty word e€ 
is not an element of L. 


Thus, in particular: (i) e-free REG is the class of the regular languages L such 
that £ ¢ L, (ii) e-free CF is the class of the context-free languages L such that € ¢ L, 
and (iii) e-free CS is the class of the context-sensitive languages L such that € ¢ L. 

Recall that we assume that the empty word e is allowed in the languages of the 
classes REG, DCF, CF, CS, REC, and R.E. In particular, we allow the empty word 
in the context-sensitive languages (see Definition 1.5.7 on page 21). Note that, on 
the contrary, J. E. Hopcroft and J.D. Ullman assume in their book [9] that every 
context-sensitive language does not include the empty word (see, in particular, [9, 
page 271]). 


Note that the classes of languages e-free CS, CS, and REC are not full AFL’s 
because they are not closed under homomorphisms. However, they are closed under 
é-free homomorphisms (recall Fact 4.0.11 on page 179). 


The class LIN of the linear context-free languages has been introduced in Defi- 
nition 3.1.22 on page 110. Now we present an alternative, equivalent definition. 
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er a 
trio 
full trio LIN 
AFL || e-free REG e-free CF | e-free CS, CS | REC 
full AFL REG CF R.E. 


TABLE 5. Abstract Families of Languages and their relation to the 
Chomsky Hierarchy. The classes REG, ¢-free REG, LIN, DCF, CF, 
e-free CF, CS, e-free CS, REC, and R.E. are, respectively, the classes 
of the regular, ¢-free regular, linear context-free, deterministic context- 
free, context-free, ¢-free context-free, context-sensitive, ¢-free context- 
sensitive, recursive, and recursively enumerable languages. 


DEFINITION 7.6.7. [Linear Context-free Language] The class LIN is the class 
of the linear context-free languages. A linear context-free language is generated by 
a context-free grammar whose productions are of the form: 


A—aB 
A— Ba 
A-a 


where A and B are nonterminal symbols and a is a terminal symbol. In a linear 
context-free language we also allow the production S — e iff e € L, where S denotes 
the axiom of the grammar. 


The closure properties of the classes of languages shown in Table 5 on page 229 
can be determined by considering also Table 4 on page 227. For instance, we have 
that the class REG of languages is closed under SinR (that is, substitution into 
regular languages) and under SbyR (that is, substitution by regular languages). 
The same holds for the classes CF and R.E. 

The classes e-free CS, CS, and REC, being AFL’s and a not full AFL’s, are 
closed under ¢-free-SinR (that is, substitution into ¢-free regular languages) and 
é-free-SbyR (that is, substitution by ¢-free regular languages). 


Note that the class of deterministic context-free languages (DCF) is not a trio. 
The class of deterministic context-free languages is closed under: 


1) complementation, 

2) inverse homomorphism, 

3) intersection with any regular language, 

4) difference with any regular language, that is, if L is a DCF language and 
R is a regular language then L— R is a DCF language. 


However, the class of deterministic context-free languages is not closed under any 
of the following operations: 
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1) e-free homomorphism (thus, the class DCF of the deterministic context-free 
languages is not a trio), 

2) concatenation, 

3) union, 

4) intersection, 

5) + closure, 

6) * closure, 

7) substitution, and 

8) reversal. 


FAcT 7.6.8. Any class of languages which is an AFL and it is closed under 
intersection, is also closed under substitution (see Definition 7.6.4 on page 227). 


The six closures properties which, by definition, are enjoyed by the class AFL 
of languages, that is, -free homomorphism, inverse homomorphism, intersection 
with regular languages, concatenation, union, and + closure, are not all independent. 
For instance, we have that concatenation follows from the other five properties. 
Analogously, union follows from the other five, and intersection with any regular 
language follows from the other five [9, Section 11.5]. 


7.7. From Finite Automata to Left Linear and Right Linear Grammars 


In this section we will present an algorithm which given any nondeterministic finite 
automaton, derives an equivalent left linear or right linear grammar. 

This algorithm uses techniques for the simplifications of context-free gram- 
mars which we have been presented in Section 3.5.3 on page 125 (elimination of 
é-productions) and Section 3.5.4 on page 126 (elimination of unit productions). 

This algorithm is perfectly symmetric with respect to the left linear case and 
the right linear case and, in that sense, it is better than any of the algorithms we 
have presented in Sections 2.2 and 2.4, that is, (i) Algorithm 2.2.3 on page 34, 
(ii) Algorithm 2.4.5 on page 40, and (iii) Algorithm 2.4.6 on page 41. 


ALGORITHM 7.7.1. 


Procedure: from Finite Automata 
to Right Linear or Left Linear Grammars. 


Input: a deterministic or nondeterministic finite automaton which accepts the lan- 
guage L C h*. 

Output: a right linear or a left linear grammar which generates the language L. 

If the finite automaton has no final states, then the right linear or the left linear 


grammar has an empty set of productions. If the finite automaton has at least one 
final state, then we perform the following steps. 


Step (1). Add a new initial state S with an e-arc to the old initial state, which will 
no longer be the initial state. Add a new final state F with ¢-arcs from the old final 
state(s) which will no longer be final state(s). 
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Step (2). For every arc A aus B, with a € SU {£}, add the production: 
A—-aB forthe right linear grammar. | B— Aa for the left linear grammar. 


Step (3). The symbol which occurs only on the left of a production, is the axiom, 
and the symbol which occurs only on the right of a production, has an ¢-production, 
that is, 


for the right linear grammar: for the left linear grammar: 
take S as the axiom take F as the axiom 
add F > € add S > € 


Step (4). Eliminate by unfolding the -production and the unit productions. 


Note 1. If the given automaton has no final states, then the language accepted by 
that automaton is empty, and both the left linear and right linear grammars we 
want to construct, have an empty set of productions. 


Note 2. After the introduction of the new initial state and the new final state, never 
the initial state is also a final state. Moreover, no arc goes to the initial state and 
no arc departs from the final state. The form of the productions A — a B and 
B — Aa for the right linear grammar and the left linear grammar, respectively, 
can be recalled by thinking at the boxed parts of the following diagrams of the arc 


AL p: 


for the right linear grammar: for the left linear grammar: 
fa re 
soa Eran 


Note that for the right linear grammar and the left linear grammar, the two symbols 
occurring on the right hand side of the production (a B and Aa, respectively), are 


in the same order in which they occur in the arc A 1 p: 


Note 3. We add exactly one production for every arc A _*, B. With reference to 
what we have said on page 44, we have that: 

(i) for the right linear grammar every state encodes its future until a final state and 
thus, A — a B tells us that the future of A is a followed by the future of B, and 
(ii) for the left linear grammar every state encodes its past from the initial state and 
thus, B — Aa tells us that the past of B is the past of A followed by a. 


Note 4. At Step (3) the choice of the axiom and the addition of the e-production 
make every symbol of the derived grammar, to be a useful symbol. 

At Step (3) we add one ¢-production only, and that ¢-production forces an empty 
future of the final state F (for the right linear grammar), and an empty past of the 
initial state S (for the left linear grammar). 

At the end of Step (3) the grammar may have one or more unit productions. O 
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7.8. Context-Free Grammars over Singleton Terminal Alphabets 
In this section we show the following result. 


THEOREM 7.8.1. If the terminal alphabet of a context-free grammar G is a 
singleton, then the language L(G) generated by the grammar G is a regular language. 


Let us consider a context-free grammar G which, without loss of generality, does 
not have -productions besides, possibly, the production S — £. Let us also assume 
that its terminal alphabet of G is a singleton. 


Let us first recall the Pumping Lemma for context-free languages (see Theo- 
rem 3.11.1 on page 150). 


LEMMA 7.8.2. [Pumping Lemma for Context-Free Languages] For every 
context-free grammar G with terminal alphabet X, there exists n >0 such that for 
all z € L(G), if |z| > n then there exist u,v,w,x,y € &*, such that 

(1) z=uvwzy, 

(2) vx Fe, 

(3) Juwa| <n, and 

(4) for alli > 0, wv'waty € L(G). 


Let us assume that the terminal alphabet of G is the set “= {a} with cardinality 1. 
Since X has cardinality 1, commutativity holds, that is, for all u,v € O*, uv = vu. 


The following lemma easily follows from the above Lemma 7.8.2. 


LEMMA 7.8.3. [Pumping Lemma for a Terminal Alphabet of Cardinal- 
ity 1| Given a context-free grammar G with a terminal alphabet X of cardinality 1, 
there exists n>0 such that for all z € L(G), if |z|>n then there exists p>0, there 
exists q, such that 

(1.1) Ie] =p+4, 

(2.1) q>0, 

(3.1) there exists m, with 0<m <p, such that 0<m-+q<n, and 

(4.1) for all s € £*, for all i>0, if |s| = p + iq then s € L(G). 


PROOF. The final part of the statement of Lemma 7.8.2 on page 232 can be 
rewritten as follows. By commutativity, we can absorb vz into v (note that v and x 
are both existentially quantified) and we get: 

... there exist u,v,w,y E€ &*, such that 

Z = uwvwy, 

VF- E 

juw| < n, and 

for all i > 0, uv'wy € L(G). 
By commutativity, we can absorb uy into u (note that u and y are both existentially 
quantified) and we get: 

... there exist u,v, w E€ X*, such that 

Z = wvwWw, 
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VUZE, 

|juw| < n, and 

for alli > 0, uv’w € L(G). 
By commutativity we can put the v’s after w, and we get: 

... there exist u,v, w E€ %*, such that 

Z = uw, 

v Æg, 

|wv| < n, and 

for alli > 0, uwv’ € L(G). 
Let p denote |uw| and q denote |v|. By taking the lengths of the words, which are 
non-negative integers, we get: 


... there exists p>0, there exists q>0, there exists w € &*, such that 


(1.1) |z| =p+4q, 
(2.1) q>0, 
(3*) |wl+q<n, and 


(4.1) for all s € &*, for all i>0, if |s| =p+ig then s € L(G). 


By Condition (2.1) we can write ‘there exists q’, instead of ‘there exists q>0’. Let 
m denote |w|. Since p= |uw|, we have that m <p, and since q >0, we can write 
0<m+q<n, instead of |w|+q<n. We get: 


... there exists p>0, there exists q, such that 


(1.1) |z| =p+q, 

(2.1) q>0, 

(3.1) there exists m, with 0<m <p, such that 0< m+q<n, and 
(4.1) for all s € X*, for all i>0, if |s| = p+ iq then s € L(G). 


By Condition (3.1) of the above Pumping Lemma 7.8.3 on page 232, we can replace 
Condition (2.1) of that lemma by the stronger condition: 0<q<n. 


Let n denote the number whose existence is asserted by the Pumping Lemma 7.8.3. 
Let us consider the following two languages subsets of L(G): 
(i) Len = {w € L(G) | |w| < n} and 
(ii) L>n = {w € L(G) | [w| > n}. 
Obviously, we have that L(G) = LenULs pn. Since Ley, is finite, Len is a regular 
language. 
Thus, in order to show that L(G) is a regular language it is enough to show, as we 
now do, that also Ly ,, is a regular language. 
Given any word z € L>n, we have that by Lemma 7.8.3, there exist po > 0 and 
qo>0 such that z = aP0 + % (take i=1) and aP0 € L(G) (take i=0). 
Since qo >0 we have that po < |z|. Now, if pọ >n, starting from a?0, instead of 
z, we get that there exist pı >0 and qı >0 such that a?0 = aP1 +q and thus, 


— q(P1 + qı) T qo, 
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In general, there exist po, qo, P1, q1, P2, Q2, - - -, Ph, dn, and h>0, such that: 


= q(P1 qı) + qo — 
— a(P2 +42) + qı + qo — 
— g(Prt+ qa) + qn- +- -- + q2 + qı + Qo (+) 


where: (C1) pa< n, and (C2) for all i, with O0<i<h, we have that p;>n. 
Note that, when writing Expression ({), we do not insist that all the q;’s are 
distinct. 


Since for all 7, with 0 <7 <h, we have that q; > 0, it is the case that for any 
z E€ Ly», we can always construct an expression of the form (t) satisfying (C1) 
and (C2). 

Thus, by writing iq, instead of the term q+...-++q where the summand q occurs 
i times, we have that every word z € Ly 7, is of the form: 


aPh + togo + - - - + tkk 


for some k, ph, to,- --, tk, qo, ---, qg Such that: 
(£0) 0<k, 
(£1) O<pr<n, 
(£2) io>0,...,ik>0, 
(£3) 0< q<n,...,0< qk <n, and 
(L4) the values of qo, . - . , qx are all distinct integers and since there are at most n 


distinct integers r such that 0<r<n, we have that k<n. 
Thus, the language Ly p, is the union of languages each of which is of the form: 


= {aPh + todo +... +t | O<k<n,0<pr <n, ig >0,..., 14 >0, 
O0<ag<n,...,0<qr<n} N ({a}* — Len) 


Lp, qo; -3 dk) 


Note that L> p is a finite union of such languages, because there exists only a finite 
number of tuples of the form (pp, qo, .--, qx) such that (£0), (£1), (£3), and (£4) 
hold. 


Note also that for any tuple of the form (pp, qo,.--,q~) such that (£0), (€1), 
(£3), and (£4) hold, we have that Lin, desea) is a regular language. Indeed, 


the finite automaton which recognizes L (Drs qo gk) is as follows: 


qo 
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By recalling that the class of regular languages is closed under finite union, finite 
intersection, and complementation, we get that Ly, is a regular language. 

This concludes the proof that every context-free grammar G over a terminal 
alphabet of cardinality 1 generates a regular language. 


Note that the proof we have given does not require Parikh’s Lemma. In the 
literature there is also a proof based on Parikh’s Lemma (see [8], Sections 6.3, 6.9, 
and Problem 4 on page 231). 


7.9. The Bernstein Theorem 


In this section we present a lattice theoretic proof of the Bernstein Theorem based 
on the following lemma due to Knaster and Tarski whose proof can be found in the 
literature (see, for instance, [16, pages 31-32]). 


LEMMA 7.9.1. |[Knaster-Tarski, 1955| Let T: L — L be a monotonic function 
on a complete lattice L ordered by a partial order denoted <. T has a least fixpoint 
which is glb{x | T(x) = x}, that is, T(glb{x | T(x) = x}) = glb{x | T(x) = x} (note 
that T(x) = x stands for T(x) < x and x < T(z)). 


THEOREM 7.9.2. |Bernstein, 1898] Given any two sets X and Y, and two 
injections f : X — Y and g : Y — X then there exists a bijection h : X > Y. 


PROOF. Let us consider: (i) the function f* : 2X — 2Y such that given any set 
AC X, f*(A) = {f(x)|x € A}, (ii) the function g* : 2Y — 2* such that given any 
set B C Y, g*(B) = {g(y) |y € B}, and (iii) the function c* : 2% — 2* such that 
given any set A C X, c*(A) = X—g*(Y —f*(A)). 

The function c* is a monotonic function from the complete lattice (2*,C) to 
itself. Indeed, if Ay C Ag then X—g*(Y—f*(A1)) C X—g*(Y —f*(Ag)). 

Thus, as a consequence of the monotonicity of c*, by Lemma 7.9.1, we have 
that there exists a fixpoint, say Š, of c*. Since X is a fixpoint, we have that 
X= Rag Vary From this equality we get: XX = XAK Sf") 
and since X —(X —A) = A for any set A C X, we get: 

X-X = g*(Y—f*(X)) (i) 

Let us consider the relation h C X x Y defined as follows: for any x € X, 

h(x) = if x € X then f(x) else g(x). 

We have that the relation h is a total function from X to Y because: (i) f is a 
total function from X to Y, being f an injection from X to Y, and (ii) g7! isa 
total function from X —X to Y because: (ii.1) g is an injection from Y to X and 
(ii.2) bay ae g*(Y) (this is a consequence of the equality (f) above). 

Now we show that h is a bijection from X to Y (see also Figure 7.9.1) by showing 
that there exists a relation k C Y x X such that: 
(1) k is a total function from Y to X, 

(2) for any x € X, k(h(x)) =z , and 
(3) for any y E Y, h(k(y)) = y. 
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FIGURE 7.9.1. Given the two injections f : X — Y and g: Y > X, 
the definition of the bijection A : X — Y is as follows: for any x € X, 
h(x) = if x € X then f(x) else g~'(x) , where X is a subset of X 
such that X = X — g*(Y — FOOL The functions f* and g* denote, 
respectively, the pointwise extensions of the injections f and g, in the 
sense that f* and g* act on sets of elements, rather than on elements, 
as the functions f and g do. The function k : Y — X is the inverse 
of the function h. 


We claim that k is defined as follows: for any y € Y, 


k(y) = if y © f*(X) then fo'(y) else gly). 
Proof of (1). k is the union of two total functions with disjoint domains whose 
union is Y. Indeed, (i) f~! is a total function from f*(X) to X because f is an 


A 


injection from X to Y, and (ii) g is total function from Y — f*(X) to X, because g 
is an injection from Y to X. 


Proof of (2). Case (2.1) Take any x € X. By the definition of h we have that 
h(x)) = f(x). Thus, we get: 

(2.1.1) k(h(x)) = k(f(x)). 
Now, since x € X we have that f(x) € f*(X), and by the definition of k we have 
that: 

(2.1.2) k(f(2)) = fF (F(2)). 
From Equations (2.1.1) and (2.1.2), by transitivity, we get: k(h(x)) = f-!(f(2)), 
and from this last equation, since f is an injection, we get: k(h(x)) = z. 
Case (2.2) Take any x ¢ X. By the definition of h we have that h(x) = g7!(z). 
Thus, we get: 

(2.2.1) k(h(x)) = k(g™"(2)). 


Now, since x ¢ X we have that g (x) g POX) (by ()), and by the definition of k 
we have that: 
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(2.2.2) k(g"'(x)) = g(g"*(2)). 
From Equations (2.2.1) and (2.2.2), by transitivity, we get: k(h(x)) = 
and from this last equation, since g is an injection, we get: k(h(x)) = z. 


Proof of (3). Case (3.1) Take any y € f*(X or By the definition of k we have that 
k(y)) = f-'(y). Thus, we get: 

(3.1.1) A(k(x)) = h(f-"(«)). 
Now, since y € PX we have that f~'(y) € Š, and by the definition of h we have 
that: 


(3.1.2) h(f"(y)) = FUF"(y))- 
From Equations (3.1.1) and (3.1.2), by transitivity, we get: h(k(y)) 
and from this last equation, since f is an injection, we get: h(k(y)) = 
Case (3.2) Take any y ¢ f*(X). By the definition of k we have that 
Thus, we get: 

(3.2.1) h(A(y)) = A(g(y)). 
Now, since y ¢ f*(X) we have that gly) ¢ X (by ({)), and by the definition of h we 
have that: 

(3.2.2) h(g(y)) = 9*(g(y)). 


From Equations (3.2.1) and (3.2.2), by transitivity, we get: h(k(y)) = 
and from this last equation, since g is an injection, we get: h(k(y)) = y. 


g(g-*(x)); 


Ff"), 


y. 
k(y) = g(y). 


~ (g(y)), 


7.10. Existence of Functions That Are Not Computable 


In this section we will show that there exist functions from the set of natural numbers 
to the set of natural numbers which are not Turing computable, that is, computable 
by a Turing Machine. We will not define this concept here and we refer to books on 
Computability Theory such as, for instance, |7, 18]. For reasons of simplicity, we 
will also say ‘computable’, instead of ‘Turing computable’. 


Let us first recall a few notational conventions. 


(i) N denotes the set of natural numbers {0,1,2,...}, 
(ii) Rio) denotes the set of reals in the open interval (0,1) with 0 and 1 


excluded, 
(iii) R(_.o,400) denotes the set of all reals, also denoted by R, 


(iv) Prog denotes the set of all programs, written in Pascal or C++ or Java or 


any other programming language in which one can write any computable 
function, 
(v) x” denotes the infinite sequence of x’s. 


We stipulate that, given any two sets A and B: 
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(vi) |A|=|B| means that there exists a bijection between A and B, 
(vii) |A| <|B| means that there exists an injection between A and B, 
(viii) |A|<|B| means that there exists an injection between A and B and 
there is no bijection from A to B. 
In what follows we will make use of the Bernstein Theorem (see Theorem 7.9.2 on 
page 235), that is, if | A|<|B| and |B|<|A| then |A|=|B|. 
We begin by proving the following theorems. 
THEOREM 7.10.1. We have the following facts: 
(i) |N|=|NUf{a}| for anya g N 
(ii) |[N|=|NxN| 
(ii) |N |=|N* 
(iv) [N> {0,1}|=|N > N] 
(vy) |{0,1}"]=[1 
PROOF. (i) We apply the Bernstein Theorem. The two injections which are required 


are: the injection from N to NU{a} which maps n to n, for any n € N, and the 
injection from NU{a} to N maps a to 0 and n to n+1, for any n € N. 


(ii) We apply the Bernstein Theorem. The injection 6 from N to N x N is defined 
as follows. We stipulate that: 


oe pee | E 


for any z E€ N, 


where |x| denotes the largest natural number less than or equal to x. 

We also stipulate that: 

s?+s 
N= Z- > 
Then for any z € N, we define 6(z) to be (n, s—n). The injection 7 from N x N to 
N is defined as follows: 
2 
3 

for any n,m E N, (n,m) = C ia = daai 
We leave it to the reader to show that m is the inverse of ô and vice versa. 

Figure 7.10.1 shows the bijection 6 between N and N x N. The function 6 is 
called the dove-tailing bijection. 


(iii) We can construct a bijection between N and NxXNXN by using twice the bijection 
between N and Nx N. Thus, by induction, we get that, for any k = 1,2,..., there 
exists a bijection between N and N*. Then, the bijection between N and N* can be 
constructed by considering a table, call it A, like that of Figure 7.10.1, where in the 
first row, for n=0, we have the elements of N, in the second row, for n=1, we have 
the elements of N x N ordered according to the bijection ô between N and N?, and 
in the generic row, for n= k, we have the elements of N*, ordered according to the 
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n=0 0 1 3 6 10 15 
1 2 4 7 11 16 
2 5 8 12 17 
3 9 13 18 
4 14 19 
5 20 


FIGURE 7.10.1. The dove-tailing bijection ô between N and NxN. 
For instance, (18) = (3, 2). 


bijection between N and N*. Table A gives a bijection between N and Gre, NE, 
also denoted Nt, by applying the dove-tailing bijection as in Figure 7.10.1. To 
get the bijection between N and U3, N*, also denoted N*, it is enough to recall 
Point (i) above because N° is a singleton. 


(iv) Since N — {0,1} is a subset of N — N, by the Bernstein Theorem, it is enough 
to construct an injection h from N — N to N — {0,1}. Each function f in N — N 
can be viewed as a 2-dimensional matrix M, like the one in Figure 7.10.1 above, 
where for 7,7 >0, M(i, j) = 1iff f(a) < j and M(i, j) = Oiff f(i) > j. The matrix M 
provides the unary representation of the value of f(7), for alli € N. Then, A(f) is 
the function in N — {0,1}, such that for each n € N, (h(f))(n) = M(d(n)). By 
construction, h is an injection. 


(v) We apply the Bernstein Theorem. The injection from {0,1}* to N is obtained by 
adding 1 to the left of any given sequence in {0,1}* and considering the correspond- 
ing natural number. The injection from N to {0,1}* is obtained by considering the 
binary representation of any given natural number. 


THEOREM 7.10.2. [Cantor Theorem] For any set A, we have that | A | <|24]. 


PROOF. An injection from A to 24 is the function which for any a € A, maps a to 
{a}. It remains to show that there is no bijection between A and 24. The proof is 
by contradiction. Let us assume that there a bijection g : A — 24. Let us consider 
the set X = {a | a € A and a ¢ g(a)}. Thus, X C A. Since g is a bijection there 
exists y in A such that g(y) = X. Now, if we suppose that y € X we get that 
y € g(y) and thus, y ¢ g(y). If we suppose that y ¢ X, we get that y ¢ g(y) and 
thus, y € g(y). This is a contradiction. 


Now we prove the following facts: 


(T1) (T2) (T3) (T4) (T5) 
IN| = |Prog] < |N = {0,1}| = [27] = [Roy] = |R,+o0) l 
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THEOREM 7.10.3. (T1): |N| = | Prog|. 


PROOF. We apply the Bernstein Theorem. The injection from N to Prog is as 
follows. For any n>0, we consider the Pascal program: 


program num; 


var x: integer; 
begin x :=0;... x := 0; end 


where the statement x := 0 occurs n times. The injection from Prog to N is as 
follows. Given a program P in Prog as a sequence of characters and consider the 
ASCII code of each character. We get a sequence of bits. By adding a 1 to the left 
of that sequence we get the binary representation of a natural number. 


(T2) 

THEOREM 7.10.4. (T2) and (T3): |N| < |N > {0,1} |20]. 
PROOF. (T2) holds because of Cantor Theorem. (T3) holds because a bijection 
between N — {0,1} and 2% is obtained by mapping an element of N to 1 iff it 
belongs to the given subset of N in 2. 


As a consequence of |N | = | Prog | and |N|<|N — {0,1}|, we have that there 
are functions from N to {0,1} which do not have their corresponding programs 
written in Pascal or C++ or Java or any other programming language in which one 
can write any computable function. Thus, there are functions from N to N which 
are not computable. 


From Theorem 7.10.1 we have that: 
(i) [N|=|NxN|, and Gi) [N> {0,1}|=|N > N]. 
Thus, | Prog|<|(NxN)— N]. 


Now we present a particular total function, called decide, from N xN to {true, 
false} for which there is no program that always halts and computes the value of 
decide(m, n) for all inputs m and n. 

Note that if we encode true by 1 and false by 0, the function decide can be 
viewed as a function from N x N to N. The function decide is the one that given 
a program prog (as a finite sequence of characters) and a value inp (as a finite 
sequence of characters), tells us whether or not prog halts for the input inp. (Recall 
that by Property (v) of Theorem 7.10.1 above, a finite a sequence of characters 
can be encoded by a natural number.) We will assume that whenever the number 
m is the encoding of a sequence of characters which is not a legal program, then 
m is the encoding of the program that halts for all inputs. We also assume that: 
(i) decide(m,n) = true iff the program which is encoded by m terminates for the 
input encoded by n, and (ii) decide(m,n) = false iff the program which is encoded 
by m does not terminate for the input encoded by n. 


In order to show that for the function decide there is no program that always 
halts and computes the value of decide(m, n) for all inputs m and n, we will reason 
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by contradiction. Let us assume that the function decide can be computed by the 
following Pascal-like program, called Decide, that always halts. 


function decide(prog, inp: text): boolean; Program Decide 


begin ... end 


Thus, decide(prog, inp) is true iff prog(inp) terminates, and decide(prog, inp) is false 
iff prog(inp) does not terminate. If program Decide exists, then it also exists the 
following program that always halts: 


function selfdecide(prog: text): boolean; Program SelfDecide 


var inp: text; 
begin inp := prog; selfdecide :=decide(prog, inp) end 


This program tests whether or not the program prog halts for the input sequence 
of characters which is prog itself. Thus, selfdecide(prog) is true iff prog(prog) ter- 
minates, and selfdecide(prog) is false iff prog(prog) does not terminate. Now, if 
program Decide exists, then also the following program exists: 


function selfdecideloop(prog: text): boolean; | Program SelfDecideLoop 


var x: integer; 


begin if selfdecide(prog) then while true do x := 0 
else selfdecideloop := false 


end 


Now, since the program SelfDecide always halts, the value of selfdecide(prog) is 
either true or false, and we have that: 


selfdecideloop(prog) does not terminate iff prog(prog) terminates. (iT) 
Now, if we consider the execution of the call selfdecideloop(selfdecideloop), we have 
that: 


selfdecideloop(selfdecideloop) does not terminate iff 
selfdecideloop(selfdecideloop) terminates. 


This contradiction is derived by instantiating Property (tt) for prog equal selfdecide- 
loop. Thus, since all program construction steps from the initial program Decide are 
valid program construction steps, we conclude that the program Decide that always 
halts, does not exist. 


We also have the following theorems. 


THEOREM 7.10.5. (T4): |2%|=|Roo,1)|- 
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PROOF. We apply the Bernstein Theorem. The injection from R@1) to 2N is ob- 
tained by considering the binary representation of each element in Ro). In the 
binary representations we assume that the decimal point is at the left, that is, the 
most significant bit is the leftmost one. Moreover, in the binary representations 
we identify a sequence of the form a01%, where ø is a finite binary sequence, with 
the sequence a10” because they represent the same real number. (Recall that the 
same identifications are done in the decimal notation where, for instance, the infinite 
strings 5.169” and 5.170” are assumed to represent the same real number.) Thus, for 
instance, 0101” is the binary representation of the real number 0.375 when written 
in the decimal notation. Indeed, in that infinite string the leftmost 1 corresponds 
to 0.250 and 01% corresponds to 0.125). 

The injection from 2% to Ro UN is obtained by considering that the infinite 
sequences of 0’s and 1’s are either binary representations of real numbers or sequences 
of the form 010”, where o is any finite binary sequence, and by Theorem 7.10.1, there 
are | N| such finite binary sequences. It remains to show that | Ron UN |= | Ro,1 |. 
The injection from Ro) to Ro1)UN is obvious. The injection from Roi) UN to 
Ro) is obtained by injecting Ro) UN into R(_.o,400) and then injecting R(—-%,+% 
into Rio) (see the proof of Theorem 7.10.6 below). oO 


THEOREM 7.10.6. (T5): | Ro, | =| Rico, +00) l- 


PROOF. The bijection between Ro) and R(—oo,+00) is the composition of the fol- 
lowing functions: (i) Ax. e” from R(—co,400) to Ro,+400), (ii) Av. arctg(x) from Ro,+00) 
to Roo,x/2), and (iii) Aw. (2x/r) from Reo, x/2) to Ro). 


In the proof of the following theorem we provide a direct proof of the fact that 
there is no bijection between N and R(—.o,+.): 


THEOREM 7.10.7. |N|<|R(~co,+00)|- 


PROOF. Since | N |< | Ri—co,400) | and | Ro, | =| -R(—co,400) |, it is enough to show 
that there is no bijection between N and Rio). We prove this fact by contradiction. 
Let us assume that there is a bijection between N and Ro), that is, there is a 
listing of all the reals in Ro). This listing can be represented as a 2-dimensional 
matrix T with 0’s and 1’s of the form: 


For n,m>0, in row n and column m, we put the m-th bit of the binary representation 
n-th real number r, of that listing (in the above matrix T we have assumed that 
the m-th bit of the binary representation of r, is 1). 
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Now we construct a real number, say d, in the open interval (0,1) which is not 
in that listing. Thus, the listing is not complete (that is, it is not a bijection) and 
we get the desired contradiction. We construct the infinite binary representation of 
d, that is, the sequence dod, ...d;... of the bits of d where dp is the most significant 
bit, as indicated by the following Procedure Diag1: 


Procedure Diag1 


1:=0; 
nextone := 0; 
while i>0 do if T(i,i)=0 then d; := 1; 
then 
if i< neztone then d; := 0 
else begin d; := 1; nextone := nezt(i); end 


where next(i) computes any value of j, with j >i, such that T(7, 7) =1. Obviously, 

we can choose j to be the smallest such value for making nest to be a function. 
The correctness of Procedure Diag1 which generates the binary representation of 

areal number d in (0,1) which is not in the given listing, derives from the following 

facts: 

(i) no binary representation of a real number in (0,1) is of the form o0”, where o 

is a finite binary sequence of 0’s and 1’s, and thus, for any given i >0, nezt(i) is 

always defined, and 

(ii) the above Procedure Diag1 is an enhancement, in the sense that we will explain 

below, of the following Procedure Diag0: 


Procedure DiagO 
1:=0; 


while i>0 do if T(i,i)=0 then d; := 1; 


if T(i,i)=1 then d; := 0; 
t:=1i+ l; 


od 


which constructs the infinite binary representation of d by taking the diagonal of the 
matrix T and interchanging 0’s and 1’s. The real number d is not in listing because 
it differs from any number in the listing for at least one bit. 


In order to construct the binary representation of the real number d we have to 
use Procedure Diag1, instead of Procedure Diag0, because we have to make sure 
that, as required by our conventions, the binary representation of d is not of the 
form a0”, for some finite binary sequence øg, that is, it does not end with an infinite 
sequence of 0’s. 
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Indeed, in order to get a binary representation of the form o0” by using Proce- 
dure Diag0, we need that for some k>0, for all h>k, T(h,h)=1. In this case let 
us consider the following portion of the matrix T: 


where: (i) i>h, (ii) j is a bit position greater than i, such that the j-th bit of r; is 
1 (recall that nezt(i) is always defined), and (iii) the j-th bit of r; is 1. 

Then Procedure Diag1, that behaves differently from Procedure Diag0, generates 
d; = 1 in position (i,7) and d; =0 in position (j, j}. This makes d to be different 
both from r; and r; in the j-th bit. Thus, after applying Procedure Diag1, we get 
a new value T1 of the matrix T of the following form: 


T1: 


Moreover, the fact that d;=1 ensures that the binary representation of d does not 
end with an infinite sequence of all 0’s, as desired. In particular, we have that the 
real number d is different from 0. 

Finally, in order to show that d € Rj), it remains to show that d is different 
from 1. Indeed, this is the case if we assume that the initial value of the matrix T 
which represents the chosen bijection between N and R.1), satisfies the following 
property: 

there exists 1>0 such that T(7,7)=1. (a) 

In this case, in fact, at least one bit of d is 0 and thus, d is different from 1. 
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Now, without loss of generality, we may assume that Property (a) holds, be- 
cause the existence of a bijection between N and Ro) which is represented by a 
matrix T which does not satisfy Property (œ), implies the existence of a different 
bijection between N and Rio) which is represented by a matrix which does satisfy 
Property (a). 

This implication is a consequence of the following two facts: 

(i) in any matrix which represents a bijection between N and Rio), every row has 
at least one occurrence of the bit 1, and 

(ii) in any matrix which represents a bijection between N and Roi), we can permute 
two of its rows so that in the derived matrix, which represents a different bijection 
between N and Roo), we have that, for some 7 >0, the bit in row 7 and column 7 
is l. 
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